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VLSI  for  High-Speed  Digital  Signal  Processing 

The  research  supported  by  this  ONR  grant  has  investigated  modem  high-speed  signal  pro¬ 
cessing  system  design.  It  has  encompassed  a  complete  spectrum  of  activities,  starting  with  the 
discovery  of  new  signal  processing  algorithms,  and  continuing  through  the  development  of  the 
most  appropriate  methods  for  their  realization,  including,  in  particular,  the  design,  layout  and  fab¬ 
rication  of  integrated  circuits. 

The  primary  project  for  this  grant  has  been  the  design  and  implementation  of  a  new  type 
of  programmable  general  purpose  digital  filter  IC.  It  employs  multiple  processing  units  on  a  sin¬ 
gle  integrated  circuit.  The  multiple  processors  operate  in  parallel  and  communicate  with  one 
another  through  on-chip  dual-access  storage  register  blocks,  thereby  incurring  no  operating  speed 
penalties  as  would  result  if  it  were  necessary  to  read  and  write  to  off-chip  RAM.  The  system’s 
topology  has  the  processors  arranged  in  a  ring,  with  locally-shared  register  blocks  between  each 
adjacent  pair  of  processors.  Our  prototype  IC  has  five  processors,  and  it  is  capable  of  realizing  a 
rich  variety  of  filter  structures  that  operate  at  the  maximum  instruction  execution  rate  possible  for 
any  custom  parallel  implementation. 

A  circuit  board  using  four  of  our  ring  processor  chips  was  also  designed  and  built.  It  dem¬ 
onstrates  the  IC’s  capability  to  perform  real-time  video  processing  and  high-speed  one-dimen¬ 
sional  processing  of  data.  It  resides  in  an  IBM  PC  computer  and  is  accessed  through  the  PC  bus. 
A  custom  software  package  was  also  written  that  facilitates  the  programming  of  the  chips  and  the 
configuration  of  the  circuit  board  for  each  of  its  several  operating  modes. 

In  addition  to  the  design  of  the  above-mentioned  ring-of-processors  IC  we  have  also 
developed  a  task  partitioner,  which  is  a  computer  program  that  automatically  writes  programs  for 
our  ring  of  parallel  processors.  It  accepts  an  arbitrary  filter’s  description  in  net-list  form  and  cal¬ 
culates  the  theoretical  optimum  sampling  period  for  the  filter’s  structure  on  a  multiprocessor  sys¬ 
tem  with  P  processors.  (In  our  case  we  of  course  set  P=5.)  Our  algorithm  detects  all  multiple- 
input  adders  in  the  desired  filter  structure  and  provides  the  user  the  option  of  searching  for  the 
optimum  adder  sequence  to  minimize  the  filter’s  sampling  period.  It  then  determines  the  opti¬ 
mum  time  schedule,  and  optimally  distributes  the  computations  over  the  processors  in  the  ring. 
The  program’s  output  is  a  set  of  programs  for  the  parallel  processors  which  causes  them  to  imple¬ 
ment  the  desired  filter  structure.  We  have  found  the  task  scheduler  capable  of  implementing  all 
practical  examples  of  digital  filter  structures  at  the  optimum  sampling  period. 

Another  project  supported  by  this  grant  has  concerned  the  design,  layout,  and  fabrication 
of  a  programmable  digital  signal  processor  using  switchable  unit-delays  for  optimal  coefficient 
allocation  in  the  implementation  of  FIR  filters.  This  architecture  enables  very  high-speed  pro¬ 
cessing  (Our  prototype  IC  proved  capable  of  implementing  FIR  filters  having  data  rates  of  180 
MHz.)  while  avoiding  the  severe  hardware  inefficiency  that  would  result  from  straightforward 
programmable  tap  implementation  such  as  the  types  that  had  been  reported  previously.  The  swit¬ 
chable  unit-delay  not  only  allows  the  programming  of  the  number  of  filter  taps  and  the  specific  fil¬ 
ter-tap  coefficient  values,  it  provides  the  capability  for  programming  the  optimal  allocation  of 
hardware  resources  to  each  filter  tap.  We  fabricated  a  prototype  chip  capable  of  realizing  a  broad 
spectrum  of  linear-phase  FIR  filters  employing  up  to  32  taps.  It  was  designed  using  Mentor 
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Graphics  GDT  VLSI  CAD  tools,  and  we  wrote  a  silicon  compiler  in  the  Genie  language  to  assem¬ 
ble  the  chip  with  parameterized  word  length  and  number  of  taps. 

Another  project  that  was  supported  by  this  ONR  grant  concerned  an  improvement  to  the 
Powell  and  Chau  linear  phase  IIR  filters.  In  this  work  we  developed  a  technique  using  Jacobian 
elliptic  functions  which,  by  removing  a  previous  method’s  double-zero  constraint,  yields 
improved  designs  of  linear  phase  HR  filters. 

Other  research  carried  out  under  the  auspices  of  this  grant  dealt  with  the  design  of  two- 
channel  perfect-reconstruction  linear-phase  FIR  filter  banks.  Two  new  approaches  were  devel¬ 
oped  for  the  design  of  such  filter  banks:  the  first,  formulating  the  problem  as  a  quadratic  program¬ 
ming  problem  with  linear  constraints,  and  the  second,  as  one  with  nonlinear  constraints.  We  also 
developed  an  optimization  technique  for  the  design  of  multiplierless  two-channel  linear-phase 
FIR  filter  banks  employing  canonic  signed-digit  (CSD)  code  using  the  new  structures,  and  another 
technique  was  developed  for  lattice-structure  perfect-reconstruction  filter  banks  with  powers-of- 
two  coefficients. 

One  further  project  supported  by  this  ONR  grant  is  worth  explicit  mention.  A  silicon  com¬ 
piler  for  Recursive  Running  Sum  (RRS)  and  Simple  Symmetric  Sharpening  (SSS)  digital  filter 
structures  was  written  and  used  to  produce  a  prototype  IC.  These  structures  have  been  shown  by 
Adams  and  Willson  to  offer  significant  advantages  in  prefilter-equalizer  type  implementations  of 
FIR  filters.  The  prototype  IC  was  designed  to  achieve  a  throughput  rate  of  175  MHz  in  1.2-pm 
CMOS. 

The  following  students  earned  Master’s  degrees  with  theses  or  projects  supported  by  this 
ONR  grant:  M.  C.P.  Chen,  M.  L.  Coulter,  H.  T.  Hung,  M.  C.Kennedy,  K-Y.  Khoo,  A.  Y.  Kwentus, 
L.  T-P.  Ying.  Four  of  these  students  are  presently  employed  in  industry  and  three  (Chen,  Khoo 
and  Kwentus)  are  currently  continuing  their  UCLA  studies  toward  Ph.D.  degrees.  The  research 
has  been  documented  in  seven  journal  publications  (and  in  two  additional  journal  publication  cur¬ 
rently  under  review)  and  13  conference  papers,  itemized  in  the  annual  reports.  One  invited  lec¬ 
ture  via  the  UCLA/SNU  telelink  was  presented  to  Seoul  National  University,  and  one  patent 
application  was  filed  based  on  the  grant’s  research. 

In  the  following  pages  we  give  a  thorough  description  of  the  Ring-of-Processors  IC,  the 
grant’s  major  project.  We  also  describe  in  some  detail  the  programmable  digital  signal  processor 
using  switchable  unit-delays  for  optimal  coefficient  allocation  in  the  implementation  of  FIR  fil¬ 
ters,  the  project  for  which  a  patent  application  was  filed.  Reprints  of  journal  papers  discussing  the 
task  partitioner  research,  the  improvement  to  the  Powell  and  Chau  linear  phase  IIR  filters,  and  the 
perfect-reconstruction  linear-phase  FIR  filter  banks  are  incorporated  into  this  report,  as  is  the  brief 
description  of  the  project  on  the  implementation  of  high-speed  programmable  digital  FIR  prefil¬ 
ters,  which  is  the  text  of  the  paper  awarded  third  prize  in  the  recent  national  IC  design  competi¬ 
tion.  (This  competition  was  sponsored  by  Memor  Graphics,  Electronic  Design,  Hewlett  Packard, 
Sun  Microsystems,  and  Texas  Instruments.)  Separate  copies  of  all  journal  papers  and  conference 
papers  have  been  submitted,  and  copies  of  all  UCLA  master’s  theses  are  available  from  the  princi¬ 
pal  investigator  upon  request. 
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I.  System  Description. 

A  new  programmable  general-purpose  digital  filter  IC  is  described  that  employs  multiple 
processing  units  on  a  single  chip.  The  multiple  processors  operate  in  parallel  and  communicate 
with  one  another  through  on-chip  dual-access  storage  register  blocks.  The  topology  of  the  digital 
filter  chip  has  the  processors  arranged  as  a  ring  with  the  locally  shared  register  blocks  between 
each  adjacent  pair  of  processors,  as  shown  in  Figure  1.  Each  processor  has  its  own  coefficient 
and  program  memory,  program  decode  logic,  and  ALU  with  a  hardware  multiplier.  As  is  shown 
in  [1],  this  ring  of  processors  is  capable  of  realizing  a  rich  variety  of  filter  structures  operating  at 
the  maximum  possible  instruction  execution  rate,  i.e.,  requiring  the  minimum  number  of  program 
steps  per  data  sample  that  can  possibly  be  achieved  for  any  custom  parallel-processing 
implementation. 

Two  digital  filter  processor  ICs  have  been  designed.  The  first  contains  a  single  processor  with 
two  adjacent  dual-port  register  blocks  while  the  second  contains  the  complete  ring  of  five 
processors  with  five  dual-port  register  blocks.  The  first  IC  was  intended  as  a  test  vehicle  to  verify 
the  operation  of  all  the  major  blocks  before  fabricating  the  full  five-processor  ring.  The  two  ICs 
are  pin-for-pin  compatible,  making  the  ring  IC  a  “drop-in”  replacement  for  the  single-processor 
IC  on  test  boards. 


Figure  1  -  Ring-structured  processor  topology. 


A.  Single-Processor  IC 

The  single-processor  IC  contains  a  processor  (consisting  of  a  coefficient  RAM,  a  program  RAM, 
program  decoding  logic,  and  an  ALU  with  a  hardware  multiplier)  as  well  as  clock  generation 


Figure  2  -  Single-processor  IC  block  diagram. 


circuitry,  input  and  output  data  synchronization  and  scaling  circuitry,  microprocessor  bus 
interface  circuitry,  and  two  dual-port  register  blocks.  It  is  in  essence  a  1-processor  “ring”.  A 
block  diagram  of  this  IC  is  shown  in  Figure  2. 

The  IC  provides  for  1 1  -bit  input  and  output  data  with  1 6-bit  internal  data.  1 2-bit  filter  coefficients 
are  stored  in  an  internal  coefficient  RAM.  It  also  includes  an  8-bit  microprocessor  bus  interface 
for  loading  programs  and  coefficients.  The  IC  contains  24,723  transistors  in  an  area  of  14.8  mm2 
(including  pads)  and  was  fabricated  through  MOSIS  in  a  1.2-pm  CMOS  N-well  process.  Testing 
results  show  >50  MHz  operation  with  a  5V  supply  voltage  and  25  MHz  operation  with  a  3V 
supply  voltage.  The  IC  was  designed  using  the  magic  CAD  tools  in  a  scalable  CMOS 
technology.  Figure  3  shows  the  chip  micrograph. 

B.  Ring-Processor  IC 

The  ring-processor  IC  contains  five  processors  and  five  register  blocks  interconnected  as  shown 
in  Figure  1 .  Each  processor  consists  of  a  coefficient  RAM,  a  program  RAM,  program  decoding 
logic,  and  an  ALU  with  a  hardware  multiplier.  The  IC  also  contains  clock  generation  circuitry, 
input  and  output  data  synchronization  and  scaling  circuitry,  and  microprocessor  bus  interface 
circuitry.  It  is  pin-for-pin  compatible  with  the  single-processor  IC. 


Figure  3  -  Single-processor  IC  chip  micrograph. 

The  core  of  the  Ring-Processor  chip  has  been  completed  and  is  currently  undergoing  extensive 
simulation  before  fabrication.  The  1C  will  be  fabricated  through  MOSIS  in  a  1.2-pm  CMOS  N- 
well  process  (same  as  the  single-processor  IC).  The  IC  core  contains  96,378  transistors  in  an  area 
of  43.8  mm2.  SPICE  simulations  indicate  that  the  IC  should  operate  at  data  rates  >  50MHz  with 
a  5V  supply  voltage.  Figure  4  shows  a  plot  of  the  core  layout. 


NOTE:  Throughout  this  documentation,  signals  that  are  input  or  output  pins  appear  in  italic 
type  while  signals  internal  to  the  IC  appear  in  regular  type. 


Figure  4  -  Ring-processor  IC  core  layout. 


U.  IC  Pinout 

The  Single-Processor  IC  is  packaged  in  an  84-pin  PGA  package.  The  Ring- Processor  IC  will 
also  be  packaged  in  an  84-pin  PGA  package  with  an  identical  pinout  to  that  of  the  Single- 
Processor  IC.  The  pinout  and  I/O  signals  are  given  below  in  Figure  5  and  Table  1. 
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Figure  5  -  84-pin  PGA. 


Table  1:  Ring-Processor  and  Single-Processor  IC  Pinout 


Pin 

Signal 

Pin 

Signal 

Pin 

Signal 

A1 

prog 

Cll 

holdjclk 

J2 

data[6] 

A2 

pmux[0] 

D1 

addr[l] 

J5 

X[1 w 

A3 

reset 

D2 

GND 

J6 

VDD 

A4 

Y_clk 

D10 

VDD 

J7 

X[6] 

A5 

mi 

Dll 

clk_bypass 

J10 

GND 

A6 

Y[10] 

El 

addr[3] 

Jll 

X[0] 

A7 

Y15] 

E2 

VDD 

K1 

data[7] 

A8 

Y[3] 

E3 

data__strobe 

K2 

VDD 

A9 

Y[l] 

E9 

elk 

K3 

data[5] 

A10 

_ 

Y[0 J 

E10 

phi2_in 

K4 

GND 

Table  1:  Ring-Processor  and  Single-Processor  IC  Pinout 


Pin 

Signal 

Pin 

Signal 

Pin 

Signal 

All 

VDD 

Ell 

GND 

K5 

datafO] 

B1 

load_sync 

FI 

proc[2] 

K6 

data[l ) 

B2 

VDD 

F2 

addr[2  ] 

K7 

X[7] 

B3 

pmux{  1  ] 

F3 

GND 

K8 

X[5J 

B4 

GND 

F9 

VDD 

K9 

X[2J 

B5 

Y[9] 

F10 

out_clk 

K10 

VDD 

B6 

Y[6] 

Fll 

X 

M, 

1 

3’ 

Kll 

X[l] 

B7 

Y[4] 

G1 

proc[  1  ] 

LI 

GND 

B8 

Y(2] 

G2 

proc[0] 

L2 

data[4] 

B9 

VDD 

G3 

VDD 

L3 

data[3] 

BIO 

GND 

G9 

ext_inclk 

L4. 

data[2] 

Bll 

phil_out 

G10 

GND 

L5 

GND 

Cl 

addr[0] 

Gil 

scale 

L6 

X[9] 

C2 

GND 

HI 

GND 

L7 

X[8] 

C5 

VDD 

H2 

chip_select 

L8 

GND 

C6 

H10 

VDD 

L9 

X(4] 

C7 

GND 

Hll 

ext_inclk_byp 

L10 

X[3] 

CIO 

phi2_out 

J1 

coeff 

LI  1 

VDD 

III.  I/O  Signal  Description. 

Following  is  a  list  of  the  input  and  output  pins  and  their  functions. 

A.  Input  Data. 

X[10:0]  input 

These  1 1  pins  provide  the  X  input  data  to  the  ring-of-processors. 
ext_inclk  input 

This  input  is  used  to  externally  clock  the  X-input  data  into  the  chip  (see  Section  V.)  It  can  be 
bypassed  using  ext_inclk_byp.  When  used,  the  input  data  is  clocked  into  the  chip  on  the  rising 
edge  of  extjnclk. 

extjnclkjbyp  input 

This  active  high  input  is  used  to  bypass  the  external  input  clock  for  the  X-input  data  (see 
Section  V.)  When  this  pin  is  set  high,  the  external  input  clock  extjnclk  is  bypassed.  When  this 
pin  is  set  low,  the  external  input  clock  extjnclk  is  used  to  clock  the  X-input  data  into  the  chip. 

B.  Output  Data. 

Y[10:0]  output 

These  1 1  pins  provide  the  Y  data  output  from  the  ring-of-processors. 

Yjclk  output 

This  is  the  output  data  clock.  An  active-high  pulse  of  width  equal  to  one  phi2  pulse  width  will 
be  output  when  the  output  data  Y  changes. 

C.  System  Clocking  (see  Section  IV.) 

elk  input 

This  input  is  the  system  clock  input.  All  internal  timing  is  derived  from  this  input  clock,  unless 
bypassed  using  the  clk_bypass  input.  Two-phase  non-overlapping  clocks  are  generated 
internally  from  this  input.  Each  processor  executes  one  instruction  for  every  cycle  of  elk. 

philjn  input 

This  input  is  used  in  conjunction  with  phi2jn  and  clkjbypass  to  provide  external  two-phase 
non-overlapping  clocks  to  the  chip. 

phi2Jn  input 

This  input  is  used  in  conjunction  with  philjn  and  clkjbypass  to  provide  external  two-phase 
non-overlapping  clocks  to  the  chip. 


clkjbypass  input 

This  active  high  input  is  used  to  bypass  the  internal  two-phase  non-overlapping  clock 
generator.  When  this  pin  is  set  low,  the  elk  input  is  used  to  generate  internal  two-phase  non¬ 
overlapping  clocks  (phil  and  phi2).  When  this  pin  is  set  high,  the  clock  generator  is  bypassed 
and  the  phil_in  and  phi2_in  pins  are  used  to  provide  the  two-phase  non-overlapping  clocks 
used  internally. 

holdjclk  input 

This  active  low  input  is  used  to  turn  off  all  the  internal  clocks  when  loading  programs.  It  can 
also  be  used  to  conserve  power  when  the  chip  is  in  a  standby  mode.  When  this  pin  is  set  low, 
all  internal  clocks  are  turned  off.  When  this  pin  is  set  high,  all  internal  clocks  are  turned  on  for 
normal  operation.  Note:  this  pin  must  be  set  low  (i.e.,  all  internal  clocks  off)  when  loading 
programs  into  the  program  memory. 

out_clk  output 

This  output  is  a  buffered  copy  of  the  system  input  clock  elk.  It  is  primarily  intended  for 
diagnostic  purposes. 

philjout  output 

This  output  is  a  buffered  copy  of  the  internal  phil  clock. 

phi2_out  output 

This  output  is  a  buffered  copy  of  the  internal  phi2  clock. 

D.  System  Programming  (see  Sections  VII.  and  VIII.) 
data[7:0]  input 

These  8  bits  are  t^e  program  data  bus  used  to  load  program  and  coefficient  data  for  the  internal 
processors. 

addr[3:0]  input 

These  4  bits  are  used  to  select  which  program  address  or  coefficient  address  receives  the 
information  on  the  program  data  bus.  These  bits  are  also  used  to  load  each  processor’s  internal 
reset  address  register. 

data_strobe  input 

This  active  low  input  is  used  to  strobe  the  data  on  the  program  data  bus  into  the  chip. 
chip_select  input 

This  active  low  input  is  used  to  select  the  chip  for  programming  information.  This  input  does 
not  affect  the  normal  operation  of  the  chip. 


proc[2:0J  input 

These  3  input  pins  are  used  to  select  which  processor  receives  the  programming  information 
on  the  program  data  bus. 

prog  input 

This  active  high  input  is  used  to  select  the  mode  of  operation  in  which  programs  are  loaded  into 
the  chip.  When  prog  is  active,  coeff  and  scale  should  both  be  inactive. 

coeff  input 

This  active  high  input  is  used  to  select  the  mode  of  operation  in  which  coefficients  are  loaded 
into  the  chip.  When  coejf  is  active,  prog  and  scale  should  both  be  inactive. 

scale  input 

This  active  high  input  is  used  to  select  the  mode  of  operation  in  which  the  input  and  output  scale 
values  are  loaded  into  the  chip.  When  scale  is  active,  prog  and  coeff  should  both  be  inactive. 

load_sync  input 

This  active  high  input  is  used  to  synchronize  the  program  address  addr[3:0]  with  the  internal 
clocks.  When  this  pin  is  set  high,  the  program  address  inputs  are  synchronized  to  the  internal 
clocks.  When  this  pin  is  set  low,  the  program  address  inputs  are  not  synchronized  to  the  internal 
clocks.  This  pin  must  be  set  low  when  programming  the  internal  processors  because  all  internal 
clocks  should  be  disabled  using  hold  clk  at  this  time.  However,  during  normal  operation,  this 
pin  must  be  set  high  to  ensure  proper  synchronization  when  changing  the  program  reset 
address. 

pmux[l:0 ]  input 

These  two  inputs  select  which  part  of  the  program  word  or  coefficient  word  the  8-bit  program 
data  bus  will  be  written  to.  When  loading  programs  onto  the  chip  (i.e.,  prog  set  high),  these 
two  inputs  select  which  8  bits  of  the  32-bit  program  word  will  be  written.  When  loading 
coefficients  onto  the  chip  (i.e.,  coeff  set  high),  pmux[  1  ]  selects  either  the  1-X  or  3-X  coefficient 
and  pmux[0]  selects  the  LSB  or  MSB  of  the  13-bit  coefficient  word. 

reset  input 

This  active  low  input  is  used  to  reset  all  internal  processor’s  program  counters  to  the  address 
stored  in  their  internal  reset  address  register. 


IV.  System  Clocking 

The  system  operates  with  a  two-phase  non-overlapping  clocking  scheme.  A  two-phase  clock 
generator  is  provided  on-chip  to  allow  for  a  single  input  clock.  Alternatively,  the  IC  can  be 
configured  to  operate  with  two  non-overlapping  clock  inputs.  Also,  a  hold_clk  signal  is  provided 
to  disable  all  internal  clocks  for  loading  programs  or  for  reducing  power  during  standby  periods. 
The  clock  control  circuitry  and  timing  is  shown  in  Figure  6. 

Table  2:  Clock  Modes 

hold_clk  clk_bypass  Clock  Mode  clock  input 

1  0  single  clock  input  elk 

clock  generator  enabled 


1  1  two  clock  input  philjn,  phi2_in 


Figure  6  -  Clock  control  circuitry  and  timing. 


V.  Input  Data  Synchronization  and  Scaling 

Figure  7  shows  the  block  diagram  for  the  input  data  synchronization  and  scaling  block  of  Figure 
2.  The  input  data  is  synchronized  to  the  internal  phi  1  clock  before  being  passed  to  the  ALU  input 
to  insure  that  it  will  be  available  for  input  to  the  ALU  at  the  same  time  that  the  other  inputs  are 
available.  Additionally,  an  external  input  data  clock  (ext_inclk)  is  provided  for  systems  that 
operate  synchronously  or  for  applications  using  a  shared  data  bus.  This  external  input  data  clock 
can  be  bypassed  if  not  used.  The  input  register  clocked  by  ext_inclk  operates  as  a  rising  edge 
triggered  flip  flop. 


Table  3:  Input  Modes 


ext_inclk_byp 

Input  Mode 

0 

external  data  clock  (ex tjnclk)  enabled 

1 

external  data  clock  (ext_inclk)  disabled 

Figure  7  •  Input  data  synchronization  and  scaling. 


The  scale  block  shown  in  Figure  7  selects  which  of  the  16  internal  bits  the  1 1  input  bits  will  be 
placed  into  (with  sign  extension  when  appropriate).  This  allows  the  input  to  be  shifted  by  up  to 
5  bits  to  the  right  (i.e.,  scaled  down  by  a  factor  of  up  to  32). 

Table  4:  Scale  Select  vs.  ALU  X-input  Data 


Scale 


000 


001 


010 


on 


100 


16-bit  X  input  to  the  ALU 


X[10] 

X[9] 

X[8] 

X[7] 

X[6] 
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X10] 
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0 

0 
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*[101 

XI9] 

X[8] 

X[7] 

X[6] 

XI5] 

X[4] 

X[3] 

X[2] 

X[l] 

X10] 
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0 

X[10] 

10] 

X[10] 
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X{8] 

X[7] 

X[6] 

X[5] 
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X[3] 
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X[IJ 

X[0] 

0 
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X[10] 

X[10] 
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X[6] 
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X[2] 

X[l] 

X[0] 
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X[10] 

X[I0] 

X[10] 

X[10] 

X[9] 

X[8J 
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AT101 
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VI.  Output  Data  Scaling 

Figure  8  shows  the  block  diagram  for  the  output  data  scaling  block  of  Figure  2.  The  scale  block 
shown  in  the  figure  selects  which  of  the  internal  16  bits  will  be  output  to  the  11  output  pads 
( Y[10:0 ]).  This  allows  for  the  output  to  be  shifted  to  the  left  by  up  to  5  bits  (i.e.,  scaled  up  by  a 
factor  of  up  to  32).  Note  that  there  is  no  overflow  protection.  Care  must  be  taken  to  ensure  that 
the  output  value  (in  two’s  complement  form)  does  not  overflow  because  errors  could  be 
substantial. 


Scale  Select 

Figure  8  -  Output  data  scaling. 


Table  5:  Scale  Select  vs.  Y  Output  Data 


Scale 

1 1-bit  Y  output  to  pads 

000 

YUS] 

Y(14] 

Y(  1 3] 

Y[12] 

Y[ll] 

Y[10] 

Y[9] 

Y[8] 

Y[7] 

Y[6] 

Y[5] 

001 

Y[  14] 

Y[  13] 

Y[  12] 

Y[ll] 

Y[10] 

Y[9] 

Y[8] 

Y[7] 

Y[6] 

Y[5] 

Y[4] 

010 

Y[13] 

YU  2] 

Y[ll] 

Y[10] 

Y[9] 

Y[8] 

Y[7] 

Y[6] 

Y[5] 

Y[4] 

Y[3] 

Oil 

Y[12] 

Y[  11] 

Y[10] 

Y[9] 

Y[8] 

Y[7] 

Y[6] 

Y[5] 

Y[4] 

Y[3] 

Y[2] 

100 

Y[  11] 

Y[I0] 

Y[9] 

Y[8] 

Y[7] 

Y(6] 

Y[5] 

Y[4] 

Y[3] 

Y[2] 

Y[l] 

101 

YH0] 

Y[9] 

Y[8] 

Y[7] 

Y[6] 

Y15] 

Y{4] 

Y13] 

Y[2] 

Yin 

Y[0] 

VII.  Coefficient  Memory 

The  coefficient  memory  is  a  static  RAM  block  that  stores  16  coefficients  for  input  to  the 
multiplier  in  the  ALU.  Each  coefficient  consists  of  a  13-bit  IX  value,  a  13-bit  3X  value,  and  a 
1-bit  Shift  that  controls  the  multiplier  output  shift  multiplexer  in  the  ALU  (see  Section  IX.  for 
more  information  on  the  ALU  architecture  and  the  multiplier  encoding  scheme).  The  output  shift 
provides  for  coefficients  in  the  range  of:  -2  <  c  <  2  allowing  for  the  implementation  of  the 
feedback  multipliers  in  a  second-order  direct  form  II  filter.  The  coefficient  RAM  is  loaded 
through  the  8-bit  microprocessor  bus  interface.  The  coefficient  loading  is  controlled  by  the  input 
signals  Coeff,  pmux[l:0],  addr[3:0],  datajstrobe ,  and  din[7:0].  The  coefficient  RAM  output  is 
controlled  by  a  4-bit  read  address  supplied  by  the  program  memory  (see  Section  VIII.).  A  block 
diagram  of  the  coefficient  RAM  is  shown  in  Figure  9.  Figure  10  shows  the  write  timing. 
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Figure  9  -  Coefficient  memory  block  diagram. 


Table  6:  Coefficient  Memory  Input  Multiplexing 


pmux[l:0] 

din[7] 

din[6] 

din[5] 

din[4] 

din[3] 

din[2] 

din[l] 

din[0] 

00 

IN1X(4] 

IN1X[3] 

IN1X[2] 

INIXtl] 

IN1X[0] 

— 

— 

Shift* 

01 

IN1X[12] 

INlXfll] 

IN1X(10] 

IN1X[9] 

IN1X[8] 

IN1X[7] 

IN1X[6] 

IN1X[5] 

10 

IN3X{4] 

IN3X[3] 

IN3X[2] 

IN3X[1] 

IN3X[0] 

— 

11 

IN3X[12] 

IN3X[11] 

IN3X[10] 

IN3X[9] 

IN3X[8] 

IN3X[7] 

IN3X[6] 

IN3X[5] 

♦This  bit  controls  the  Multiplier  Output  Shift  Multiplexer  in  the  ALU. 
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Figure  10  -  Coefficient  memory  write  timing. 


VIII.  Program  Memory 

The  program  memory  is  a  static  RAM  block  that  stores  up  to  16  instructions.  Each  instruction  is 
32-bits  wide,  where  the  bits  control  the  ALU  data  path  (see  Section  IX.)  and  provide  read  and 
write  addresses  to  the  register  blocks  and  coefficient  RAM.  Each  processor  has  its  own 
independent  program  memory.  Program  instructions  are  loaded  through  the  8-bit  microprocessor 
bus  interface.  Two  multiplexer  control  signals  (pmux[l:0] )  are  used  to  select  which  8-bit  byte 
within  the  32-bit  instruction  is  being  written.  The  instructions  can  be  written  randomly,  but  are 
read  out  sequentially.  The  address  counter  is  incremented  on  the  rising  edge  of  phil  and  the 
instruction  word  is  latched  at  the  output  of  the  RAM  block  on  the  rising  edge  of  phi2.  When  reset 
(either  internally  or  externally),  the  program  address  counter  is  forced  to  the  stored  reset  address. 
The  reset  address  can  be  changed  at  any  time  during  operation  through  the  microprocessor  bus 
interface.  Thus,  multiple  programs  can  be  loaded  and  switched  between  during  operation.  For 
example,  adaptive  filters  can  be  realized  by  programming  two  copies  of  the  filter  using  different 
coefficient  addresses.  While  the  first  copy  of  the  filter  is  being  run,  the  coefficients  for  the  second 
copy  can  be  updated  from  off-chip.  Once  updated,  the  coefficients  can  be  “switched  in”  by 
changing  the  reset  address  to  the  start  of  the  second  program.  Once  switched,  '.he  coefficients 
associated  with  the  first  program  can  be  updated.  Figure  1 1  shows  the  write  timing  for  loading 
program  instructions.  Figure  12  shows  the  reset  timing,  and  Figure  13  shows  the  timing  for 
changing  the  reset  address  during  normal  operation. 

Table  7:  Program  Instruction  Bit  Functions 


Bit 

Name 

Function 

31 

Reset 

Program  Reset 

0 

No  Reset 

1 

Reset 

30-27 

rep,RW 

Right  Register  Block  Write  Address 

26-23 

regLW 

Left  Register  Block  Write  Address 

22 

selR 

Right  Output  Bus  Multiplexer  Control  (ALU) 

0 

Operand  #1  (Opl) 

1 

Adder  Output  (Sum) 

21 

selL 

Left  Output  Bus  Multiplexer  Control  (ALU) 

0 

Operand  #1  (Opl) 

1 

Adder  Output  (Sum) 

20 

selB 

Add  /  Subtract  (ALU) 

0 

Add 

1 

Subtract 

Table  7:  Program  Instruction  Bit  Functions 


Bit 

Name 

Function 

19-18 

selA 

A  Input  to  Adder  -  Multiplexer  Control  (ALU) 

00 

Multiplier 

01 

Opl 

10 

Zero 

11 

NOT  USED /INVALID 

17-14 


13-10 


regRR 


regLR 


Right  Register  Block  Read  Address 


Left  Register  Block  Read  Address 


Operand  #2  Multiplexer  Control  (ALU) 


000 

X  Input  Data 

001 

Right  Register  Block’s  Output  (RR) 

010 

Left  Register  Block’s  Output  (LR) 

Oil 

Right  Processor’s  ALU  Output  (RA) 

100 

Left  Processor’s  ALU  Output  (LA) 

101 

Same  Processor’s  ALU  Output  (A) 

110 

Zero 

111 

NOT  USED /INVALID 

Operand  #1  Multiplexer  Control  (ALU) 

000 

X  Input  Data 

001 

Right  Register  Block’s  Output  (RR) 

010 

Left  Register  Block’s  Output  (LR) 

Oil 

Right  Processor’s  ALU  Output  (RA) 

100 

Left  Processor’s  ALU  Output  (LA) 

101 

Same  Processor’s  ALU  Output  (A) 

110 

Zero 

111 

NOT  USED /INVALID 

C-RA  Coefficient  Read  Address 


vv  vv  w  w 


Figure  11  -  Program  memory  write  timing. 
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Figure  12  -  Program  memory  reset  timing. 


Table  8:  Program  Memory  Input  Multiplexing 


pmux[l:0] 

din[7] 

din[6] 

din[5] 

din[4] 

din[3] 

din[2] 

dinll] 

din[0 ] 

00 

prog[7] 

prog[6] 

prog[5] 

prog[4] 

I»og[3] 

progf2] 

progf  1] 

progfO] 

01 

progf  15] 

progf  14] 

progf  13] 

progf  12] 

progfll] 

progf  10] 

prog[9] 

prog[8] 

10 

prog[23J 

prog[22] 

prog[21] 

prog{20] 

progf  19] 

progf  18] 

|HOg[17] 

progfl6] 

11 

prog[313 

prog[30] 

progf29] 

prog[28] 

prog[27] 

progf26] 

prog[25] 

prog[24] 

The  addr[3:0]  bus  used  to  select  the  program  or  coefficient  to  be  loaded  can  be  synchronized  to 
the  internal  clocks  using  the  loadjsync  signal.  This  must  be  done  when  loading  coefficients  or 
changing  the  reset  address  while  the  chip  is  in  normal  operation.  When  initially  loading 
coefficients  or  programs,  the  internal  clocks  are  typically  turned  off  using  the  hotdjclk  signal. 
During  this  mode  of  operation,  the  addr[3:0]  bus  synchronization  register  must  be  bypassed 
using  the  load_sync  signal.  Figure  14  shows  a  block  diagram  of  the  addr[3:0]  bus 
synchronization  circuitry. 


Table  9:  addr[3:0]  Bus  Synchronization  Control 


loadjsync 

MUX  Output 

0 

addr[3:0]  (directly  from  pads) 

1 

synchronization  register  output 

phi  1  load_sync 

Figure  14  -  addr[3:0]  bus  synchronization. 


The  outputs  of  the  program  memory  are  pipelined  to  match  the  delay  through  the  ALU  (see 
Section  IX.)  Thus,  the  register  block  read  and  write  addresses  for  a  given  operation  are  stored 
within  the  same  program  word  even  though  the  actual  read  and  write  operations  occur  two  clock 
cycles  apart  due  to  the  pipeline  delay  through  the  ALU.  Figure  15  shows  the  program  memory 
output  pipelining. 


2  =  Pipeline  Register  Clocked  by  phi2 


Figure  15  -  Program  memory  output  pipelining. 


IX.  ALU 


The  ALU  is  the  “heart”  of  the  processor,  where  all  arithmetic  computations  are  performed.  Each 
processor’s  ALU  contains  an  1 1-bit  by  1 1-bit  hardware  multiplier,  a  16-word  coefficient  memory 
that  provides  one  input  to  the  multiplier  (see  Section  VII.),  a  16-bit  adder  /  subtracter,  and  several 
multiplexers  that  control  the  data  flow  within  the  ALU.  The  ALU  is  pipelined  so  that  the  multiply 
operation  occurs  in  one  clock  cycle.  Thus,  the  ALU  performs  a  multiplication  and  an  addition 
simultaneously  every  clock  cycle.  All  data  inputs  to  the  ALU  are  latched  in  at  the  rising  edge  of 
phi2.  The  ALU  outputs  are  latched  out  at  the  rising  edge  of  phi2.  The  multiplexer  control  signals 
that  control  the  data  flow  within  the  ALU  are  provided  to  the  ALU  by  the  program  memory  at 
the  rising  edge  of  either  phi  1  or  phi2  (see  Section  VIII.)  A  block  diagram  of  the  ALU  is  shown 
in  Figure  16. 

The  operation  of  the  ALU  is  as  follows.  First,  two  7-tol  multiplexers  select  the  two  input 
operands  (Opl  and  Op2).  Each  operand  is  independently  selected  from  the  X  input  data  (X),  the 
right  register  block’s  output  (RR),  the  left  register  block’s  output  (LR),  the  right  adjacent 
processor’s  ALU  output  from  the  previous  clock  cycle  (RA),  the  left  adjacent  processor’s  ALU 
output  from  the  previous  clock  cycle  (LA),  the  current  processor’s  ALU  output  from  the  previous 
clock  cycle  (A),  or  zero  (Z).  The  multiplexer  control  signals  (selOpl  and  selOp2)  are  provided 
by  the  program  memory  a  short  time  after  the  rising  edge  of  phi2  (see  Section  VIII.)  The 
multiplexer  outputs  are  latched  into  the  ALU  at  the  rising  edge  of  phi2,  as  shown  in  Figure  16. 
The  Opl  input  is  then  truncated  to  1 1 -bits  and  provided  as  one  input  to  the  multiplier.  The  second 
input  to  the  multiplier  is  provided  by  the  coefficient  RAM  (see  Section  VII.)  The  coefficient 
RAM  is  a  16-word  static  RAM  that  stores  a  13-bit  IX  value  (the  coefficient  value)  and  a  13-bit 
3X  value  (3  times  the  coefficient  value).  The  coefficient  memory  provides  an  additional  bit  used 
to  control  a  multiplexer  at  the  output  of  the  multiplier.  This  multiplexer  allows  the  multiplier 
output  to  be  shifted  to  the  left  by  1-bit  (i.e.,  multiplied  by  2)  if  desired.  This  capability  is  used  to 
implement  coefficients  in  the  range  of  -2<c<2  for  the  feedback  multipliers  in  a  second-order 
direct  form  II  HR  filter.  In  order  to  achieve  high-speed  operation  in  a  small  chip  area,  the 
multiplier  was  designed  to  take  advantage  of  both  IX  and  3X  inputs.  For  a  more  detailed 
discussion  of  the  multiplier  refer  to  [2].  The  coefficient  memory  read  address  is  provided  by  the 
program  memory  shortly  after  the  rising  edge  of  phi2  (see  Section  VIII.)  The  coefficient  memory 
outputs  are  latched  into  the  multiplier  on  the  rising  edge  of  phi2,  as  shown  in  Figure  16.  Since 
the  multiplier’s  inputs  and  outputs  are  latched  at  the  rising  edge  of  phi2,  the  multiplier  has  the 
entire  clock  cycle  to  perform  its  computation.  The  A  input  to  the  adder  is  supplied  by  the  selA 
multiplexer  (see  Figure  16).  It  selects  either  Operand  #1,  the  multiplier’s  output,  or  zero  as  input 
to  the  adder.  The  B  input  to  the  adder  is  either  Operand  #2  or  the  one’s  complement  of  Operand 
#2  (i.e.,  all  bits  inverted),  selected  by  the  selB  multiplexer.  For  an  addition  operation.  Operand 
#2  is  selected.  For  a  subtraction  operation,  the  one's  complement  of  Operand  #2  is  selected  and 
the  two’s  complement  is  formed  by  adding  in  the  selB  control  signal  to  the  adder’s  input  cany 
The  adder  input  multiplexer  control  signals  (selA  and  selB)  are  provided  by  the  program  memory 
at  the  rising  edge  of  phi2  (see  Section  VIII.)  The  ALU  provides  outputs  to  both  the  left  and  right 
register  blocks.  These  outputs  can  be  individually  selected  as  either  the  adder’s  output  or 
Operand  #1,  as  shown  in  Figure  16.  The  ALU’s  output  is  latched  at  the  rising  edge  of  phi2.  The 
output  select  multiplexer  control  signals  (selL  and  selR)  are  provided  by  the  program  memory  at 
the  rising  edge  of  phi  1  (see  Section  VIII.) 


Additionally,  the  adder’s  output  is  made  available  as  an  input  to  both  the  left  adjacent  processor 
and  the  right  adjacent  processor.  Thus,  both  processors  have  access  to  the  result  for  use  in  the 
next  clock  cycle,  effectively  bypassing  the  dual-port  register  blocks. 


XRRLRRALAAZ  XRRLRRALAAZ 


Figure  16  -  ALU  block  diagram. 


X.  Dual-Port  Register  Block 

The  dual-port  register  block  is  a  16-word  by  16-bit  dual-port  static  RAM.  Each  dual-port  RAM 
block  is  connected  between  two  processors  providing  simple  interprocessor  communication. 
The  register  block  has  separate  read  and  write  data  and  address  busses  for  each  processor.  Thus, 
each  processor’s  access  to  the  register  block  is  completely  independent.  Automated 
programming  techniques  described  in  [1]  are  used  to  ensure  that  both  processors  do  not  write  to 
the  same  memory  location  at  the  same  time.  The  register  block’s  read  addresses  are  provided  by 
the  program  memory  at  the  rising  edge  of  phi2.  The  register  block’s  write  addresses  are  provided 
by  the  program  memory  at  the  rising  edge  of  phil  (see  Section  Vm.)  The  register  block’s  input 
data  is  provided  by  the  ALU’s  output  at  the  rising  edge  of  phi2.  The  register  block’s  output  oata 
is  provided  as  input  to  the  ALU.  The  ALU’s  input  registers  latch  the  data  at  the  rising  edge  of 
phi2  (see  Section  IX.)  The  input  data  is  written  into  the  selected  storage  location  during  the  time 
when  phi2  is  high.  Since  the  write  address  is  provided  at  the  rising  edge  of  phil,  the  decoder 
outputs  are  stable  before  the  write  cycle  begins.  Separate  input  and  output  data  buses  are  used  to 
allow  for  high-speed  operation.  Figure  17  shows  a  block  diagram  of  the  dual-port  register  block. 
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Figure  17  -  Block  diagram  of  the  dual-port  register  block. 


XI.  Internal  System  Timing 

The  ring-processor  system  operates  with  a  two-phase  non-overlapping  clock  scheme  (phil  and 
phi2),  as  discussed  in  Section  IV.  Following  is  detailed  timing  information  for  all  the  major 
blocks  within  the  system. 


Operand  Latch_ 
Mult  Out  Latch 
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Sel  A _ 

Sel  B  /  Carry _ 

Sel  L/R  Latch _ 

Sel  L  _ 
Sel  R _ 


ALU  Out  Latch 


XII.  IC  Testing  Results 

During  the  course  of  this  project,  several  ICs  were  designed  and  fabricated  to  test  out  the  major 
blocks  in  the  ring-processor  system  before  fabricating  the  complete  five-processor  system. 
Several  of  the  test  ICs  were  fabricated  as  MOSIS  TinyChips.  A  TinyChip  is  a  specially  available 
option  from  MOSIS  for  40-pin  ICs  of  a  prescribed  size  (2.25  mm  by  2.22  mm  including  pads 
where  the  pads  are  in  known  locations)  fabricated  in  2-p.m  CMOS  technology.  This  option  is 
offered  at  a  very  low  price  because  the  pads  are  in  known  locations  making  packaging  easier, 
only  4  packaged  parts  are  returned,  and  no  chip  micrograph  is  taken.  Due  to  the  low  price  of  this 
option,  several  of  the  test  ICs  fabricated  for  this  project  were  designed  as  MOSIS  TinyChips.  All 
ICs  were  tested  using  a  Tektronix  LV500  IC  tester.  Although  the  LV500  tester  is  only  capable  of 
generating  input  test  patterns  at  a  maximum  clock  rate  of  50  MHz  (i.e.,  20  ns  cycle),  it  can  control 
transition  edges  within  a  given  clock  cycle  in  0.5  ns  increments.  Thus,  it  is  possible  to  test 
circuits  that  operate  at  a  clock  rate  higher  than  50  MHz  by  including  additional  input  and  output 
registers  around  the  test  circuit  that  are  clocked  by  separate  clocks  and  then  adjusting  the  timing 
between  the  two  clocks  within  a  given  20  ns  LV500  test  pattern  cycle.  This  was  the  test 
methodology  adopted  for  testing  most  of  the  ICs  described  below.  Following  is  a  brief  discussion 
of  each  of  the  test  ICs  fabricated  and  the  testing  results. 

A.  Dual-Port  Register  Block  Test  IC 

This  IC  was  fabricated  to  test  the  dual-port  register  block.  It  contains  a  16- word  by  16-bit  dual¬ 
port  RAM  block,  input  and  output  data  registers,  and  read  and  write  address  registers.  All 
registers,  included  for  testing  purposes,  are  clocked  independently  to  facilitate  accurate 
measurement  of  the  read  and  write  timing,  as  discussed  previously.  Due  to  the  pad  limitations  of 
MOSIS  TinyChips,  only  3  input  bits  and  3  output  bits  were  brought  out  to  the  pads  for 
observation.  Figure  18  shows  a  block  diagram  of  this  test  IC.  The  register  block  core  contains 
4,064  transistors  in  a  chip  area  of  1 .66  mm2  and  was  fabricated  through  MOSIS  in  a  2-|im  CMOS 
P-well  technology  (TinyChip).  All  4  parts  received  from  MOSIS  were  fully  functional  with  a 
worst-case  read  time  of  15  ns  and  a  worst-case  write  time  of  16.5  ns. 


Table  10:  Dual-Port  Register  Block  IC  Testing  Results 


inA  inB 


outA  outB 


Figure  18  -  Block  diagram  of  the  dual-port  register  block  test  IC. 


B.  11-bit  by  11-bit  Multiplier  Test  IC 

This  IC  was  fabricated  to  test  and  characterize  the  multiplier  used  in  the  ALU.  It  contains  the  li¬ 
bit  by  11 -bit  multiplier,  the  coefficient  and  input  data  registers,  the  output  data  registers,  and 
RAM  to  store  the  coefficient  and  input  data.  Figure  19  shows  a  block  diagram  of  this  test  IC. 
The  registers  and  RAM  were  included  to  try  to  accurately  model  the  environment  that  the 
multiplier  would  see  within  the  ALU  (i.e.,  loading,  drive  capability,  etc.)  Separate  input  and 
output  register  clocks  were  provided  to  facilitate  accurate  testing  of  the  multiplier  delay  (as 
discussed  above).  The  multiplier  core  contains  3,492  transistors  in  a  chip  area  of  1.S3  mm2 
(1.313  mm  by  1.116  mm)  and  was  fabricated  through  MOSIS  in  a  2-pm  CMOS  N-well 
technology  (TinyChip).  SPICE  simulations  indicated  a  worst-case  operating  time  of  22.5  ns 
(including  the  delay  of  the  input  registers).  All  4  parts  received  from  MOSIS  were  fully 
functional  with  a  worst-case  operating  time  of  23  ns.  Testing  results  are  given  in  Table  11  and 
Figure  20  shows  the  layout  of  the  IC.  For  more  detailed  information  about  the  multiplier  refer  to 
[2]. 


Table  11: 11-bit  by  11-bit  Multiplier  IC  Test  Results 


Output  Data 


Figure  19  -  Block  diagram  of  the  11-bit  by  11-bit  multiplier  test  IC. 


Figure  20  -  Layout  for  the  11-bit  by  11-bit  multiplier  test  IC. 


C.  11-bit  by  16-bit  Multiplier  Test  IC 

This  IC  was  fabricated  as  an  extension  to  the  11 -bit  by  11 -bit  multiplier.  It  uses  a  3rd  order 
recoding  scheme  (as  opposed  to  the  2nd  order  recoding  scheme  used  in  the  previous  multiplier) 
to  extend  the  data  precision  to  16-bits  while  using  the  same  number  of  partial  products.  This  is 
achieved  by  replacing  the  4-to-l  multiplexers  used  in  the  1 1-bit  by  1 1-bit  multiplier  with  8-to-l 
multiplexers.  The  test  IC  includes  input  data  and  coefficient  registers,  RAM  to  store  the 
coefficient  and  input  data,  and  output  data  registers.  The  input  and  output  data  registers  are 
clocked  by  different  clocks  to  facilitate  high-speed  testing,  as  described  previously.  Figure  21 
shows  a  block  diagram  of  the  test  IC.  For  more  information  on  the  multiplier  refer  to  [2].  The 
multiplier  core  contains  5,035  transistors  in  a  chip  area  of  0.9  mm2  (0.88  mm  by  1.05  mm)  and 
was  fabricated  through  MOSIS  in  a  1.2-pm  CMOS  N-well  technology.  SPICE  simulations 
indicated  a  worst-case  operating  time  of  16  ns  (including  the  register  delays).  Of  the  24  parts 
received  from  MOSIS,  20  were  found  to  be  fully  functional  with  worst-case  operating  times 
ranging  from  17.5  ns  to  19  ns  with  a  mean  of  18.175  ns.  Figure  22  shows  the  test  results  for  5V 
and  3V  supply  voltages  and  Figure  23  shows  the  chip  micrograph. 
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Figure  21  -  Block  diagram  of  the  11-bit  by  16-bit  multiplier  test  IC. 
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Figure  22  -  Testing  results  for  the  11-bit  by  16-bit  multiplier  test  IC. 


Figure  23  -  Chip  micrograph  of  the  11-bit  by  16-bit  multiplier  test  IC. 


D.  Single-Processor  Test  IC 


This  IC  was  fabricated  to  test  out  the  major  blocks  of  the  ring-processor  system  before  fabricating 
the  complete  five-processor  ring.  It  consists  of  a  single  processor  and  two  dual-port  register 
blocks  (basically,  a  one-processor  “ring”).  The  chip  provides  for  11 -bit  input  and  output  data 
with  16-bit  internal  data  and  12-bit  coefficients  (stored  in  on-chip  memory).  The  IC  also  has  an 
8-bit  microprocessor  bus  interface  for  loading  programs  and  coefficients.  A  block  diagram  of  the 
IC  is  shown  in  Figure  2.  The  IC  contains  24,723  transistors  in  a  chip  area  of  14.8  mm2  (3.7  mm 
by  4.0  mm  including  pads)  and  was  fabricated  through  MOSIS  in  a  1.2-{im  CMOS  N-well 
technology.  Of  the  24  parts  received  from  MOSIS,  19  were  fully  functional  and  all  operated  at 
a  clock  rate  >50  MHz  (the  limit  of  the  LV500  IC  Tester).  Figure  24  shows  the  minimum  supply 
voltage  for  50  MHz  operation  and  the  minimum  clock  period  for  a  3.3  V  supply  voltage.  Due  to 
limitations  of  the  LV500  IC  Tester,  the  minimum  clock  cycle  period  can  only  be  tested  in  4  ns 
steps.  The  single-processor  IC  was  also  programmed  to  implement  several  different  filters. 
Figure  25  shows  the  transfer  function  of  a  15-tap  lowpass  FIR  filter  which  was  run  on  the  IC  at 
a  50  MHz  instruction  clock  rate.  The  filter  requires  15  program  steps  so  the  data  rate  is  3.33  MHz 
(only  3  steps  will  be  required  on  the  five-processor  ring  so  the  data  rate  will  be  16.67  MHz). 
Figure  26  shows  testing  results  of  the  single-processor  IC  programmed  to  implement  the  15th 
order  lowpass  filter  for  a  two-tone  input.  The  instruction  clock  rate  is  50  MHz  giving  a  sample 
rate  Fs  of  3.33  MHz  and  thus  the  input  tones  are  at  333  KHz  (normalized  frequency  0.1)  and 
832.5  KHz  (normalized  frequency  0.25)  respectively.  The  chip  micrograph  is  shown  in  Figure  3. 
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Figure  24  •  Single-processor  IC  testing  results. 
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Figure  25  -  15-tap  FIR  lowpass  filter  transfer  function. 


Figure  26  •  Two-tone  input  and  single-processor  IC  output  for  the  15-tap  lowpass  filter. 
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An  Efficient  180  MHz  Programmable  FIR  Digital 

Filter 


1  Introduction 

FIR  filtering  is  without  a  doubt  one  of  the  most  important  digital  signal  processing  op¬ 
erations.  In  modern  high-speed  digital  signal  processing  systems,  data-rates  of  100  MHz 
are  becoming  increasingly  common.  Implementing  FIR  filters  at  such  high  data-rates  of¬ 
ten  requires  the  use  of  dedicated  (non-programmable)  custom  application  specific  integrated 
circuits  (ASICs).  However,  programmable  FIR  filters  are  required  in  many  applications  in¬ 
volving  adaptive  filtering,  and  they  are  often  desirable  for  rapid  prototyping,  or  for  use  in 
small  volume  applications  where  the  cost  of  custom  filter  chips  may  be  prohibitive.  When 
implemented  efficiently,  a  programmable  filter  can  also  be  used  instead  of  a  custom  FIR 
filter  ASIC  with  advantages  similar  to  those  of  FPGAs  (i.e.,  it  is  an  off-the-shelf  standard 
product  with  no  NRE  costs  and  no  inventory  risk,  it  facilitates  fast  time  to  market,  it  is 
factory  tested,  and  it  allows  design  changes  anytime).  In  addition  to  the  increase  in  data- 
rates,  another  trend  in  high-performance  signal  processing  systems  is  the  increase  in  data 
word  length.  These  factors,  a  longer  word  length  and  a  higher  data-rate,  make  the  efficient 
implementation  of  a  programmable  FIR  digital  filter  very  challenging. 

The  implementation  of  high-speed  programmable  FIR  digital  filters  (or  correlators)  is 
well  researched.  Invariably  the  transposed  direct  form  FIR  structure  is  used,  with  a  separate 
multiplier  for  each1  filter  tap  (i.e.,  each  sample  of  the  filter’s  impulse  response).  In  such 
an  implementation  the  data-rate  is  limited  only  by  each  filter  tap’s  delay,  which  is  largely 
the  time  required  for  a  multiply  and  an  add  operation.  The  drawback,  however,  is  the 
large  chip  area  required  to  accommodate  a  large  number  of  multipliers.  Various  methods 
to  reduce  the  complexity  and  hence  the  area  of  the  multipliers  have  been  reported  in  the 
literature.  In  [1]  serial  multipliers  are  used,  which  severely  limit  the  data-rate.  In  [2]  an 
EPROM  storing  the  products  of  all  possible  inputs  by  all  filter  coefficients  is  used  in  place 
of  the  multiplier.  However,  such  intensive  chip  programming  requirements  severely  limits  its 
use  as  an  adaptive  filter.  Advances  in  modern  CMOS  technology  have  also  made  possible  a 
straightforward  integration  of  a  large  number  of  standard  multipliers  on  a  single  chip.  For 
example,  [3]  reports  a  programmable  filter  chip  consisting  of  40  standard  multipliers  using 
0.9-/im  CMOS  technology.  However,  this  approach  does  not  scale  well  with  increasing  word 

‘Or  a  separate  multiplier  for  each  pair  of  samples  of  the  symmetric  impulse  response  of  linear-phase 
filters. 
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length  since  the  area  complexity  of  a  standard  multiplier  varies  as  the  square  of  the  word 
length. 

An  effective  method  to  reduce  the  complexity  of  the  multipliers  for  the  case  of  dedicated 
(non-programmable)  FIR  filters  is  to  use  the  canonic  signed-digit  (CSD)  [4-6]  representation 
of  the  coefficient  values.  In  essence,  the  CSD  representation  reduces  the  number  of  coefficient 
digits  needed  to  represent  each  coefficient  value,  which  correspondingly  reduces  the  number 
of  partial  products  produced  when  multiplying  the  input  data  by  the  coefficient  values. 
This  method,  along  with  algorithms  to  design  FIR  filters  with  powers-of-two  coefficients  [6], 
results  in  the  very  efficient  implementations  of  high  speed  dedicated  FIR  filters.  Silicon 
compilers  which  produce  the  layout  for  such  dedicated  FIR  filter  chips  using  CSD  coefficient 
representation  are  also  readily  available  [7].  This  approach,  however,  cannot  be  readily 
adapted  to  a  programmable  structure  because  neither  the  number  of  CSD  coefficient  digits 
nor  the  position  of  the  individual  CSD  coefficient  digits  is  known  prior  to  programming. 

In  this  paper  we  describe  an  effective  solution  to  the  problem  of  using  the  CSD  approach 
for  a  programmable  FIR  filter  (or  correlator)  structure,  and  we  present  [8]  the  first  efficient 
implementation  of  a  programmable  linear-phase  FIR  digital  filter  using  CSD  coefficients.  We 
show  that  it  is  possible  to  achieve  high-speed  processing  while  avoiding  the  severe  hardware 
inefficiency  that  would  result  from  a  straightforward  programmable  tap  implementation  [1- 
3].  In  a  straightforward  implementation  many  filter- tap  “multipliers”  would  significantly 
waste  valuable  computational  resources  since  all  taps  of  a  programmable  structure  would 
need  to  accommodate  “difficult”  coefficient  values,  while  for  any  specific  filter  most  taps 
would  not  require  such  extreme  capabilities.  For  example,  the  taps  whose  coefficient  values 
require  higher  precision  are  often  located  near  the  center  of  the  impulse  response  of  a  typical 
lowpass  FIR  filter. 

Our  approach  not  only  allows  the  programming  of  the  number  of  filter  taps  and  the 
specific  filter-tap  coefficient  values,  but  it  also  provides  the  capability  for  programming  the 
optimal  allocation  of  hardware  resources  to  each  filter  tap.  Thus  the  computational  resources 
that  otherwise  might  have  been  wasted  are  made  available  to  further  increase  the  precision  in 
any  tap’s  coefficient  representation,  or  for  use  in  implementing  a  larger  number  of  filter  taps. 
We  have  achieved  these  unique  advantages  in  our  design  by  developing  a  novel  switchable 
unit-delay.  We  have  verified  the  ideas  in  a  prototype  chip  that  is  capable  of  implementing 
a  broad  spectrum  of  linear-phase  FIR  filters  employing  up  to  32  taps  with  16-bit  input 
and  output  data,  in  a  die  size  of  5.9  mm  by  3.4  mm  using  1.2-pm  CMOS  technology.  The 
prototype  chip  has  been  fabricated  through  the  MOSIS  service  and  tested  to  operate  at 
data-rates  as  high  as  180  MHz. 

Section  2  briefly  reviews  the  FIR  filter  and  the  signed-digit  representation  for  numbers. 
It  then  introduces  the  programmable  unit  tap  (p-tap)  that  is  the  basic  element  of  our  new 
programmable  structure.  Section  3  describes  efficient  circuits  implementations  for  the  pro¬ 
grammable  filter  structure.  Section  4  describes  a  prototype  chip  that  implements  a  linear- 
phase  FIR  filter  employing  up  to  32  taps  with  16-bit  input  and  output  data  and  operating 
at  data-rates  as  high  as  180  MHz.  Section  5  shows  some  design  examples  illustrating  the 
advantages  of  our  architecture. 
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2  Programmable  FIR  Filter  Architecture 

2.1  Review  of  FIR  Filters  and  Correlators 


The  time-domain  input-output  relation  for  a  causal  Finite  Impulse  Response  (FIR)  system 
with  impulse  response  h(n)  is  given  by  the  convolution  formula 


M-l 

y(n )  =  53  h(k)x(n  -  k) 

k= 0 

(1) 

=  h(n)*x(n) 

(2) 

where  M  is  the  length  of  the  filter,  Af  —  1  is  the  order  of  the  filter,  and  *  denotes  the 
convolution  operator.  The  minimum  length  M  needed  to  implement  a  typical  low-pass  filter 
response  is  approximately  proportional  to  the  inverse  of  the  normalized  transition  bandwidth 
of  the  filter’s  frequency  response  [9].  Therefore,  for  a  programmable  filter  to  be  able  to  realize 
filter  responses  with  sharp  transition  bands,  we  must  allocate  as  large  a  number  of  taps  M 
as  possible  to  the  programmable  filter. 

A  mathematical  operation  that  closely  resembles  convolution  is  correlation.  For  two 
signal  sequences  x(n)  and  y(n)  each  of  which  has  finite  energy,  the  crosscorrelation  of  x(n) 
and  y(n)  is  a  sequence  rxy(n)  given  by 

OO 

rx„(n)  =  51  x{k)y(k-n)  (3) 

k=—co 

=  x(n)  *  y(—n)  (4) 

It  is  obvious  that  an  FIR  filter  (convolver)  can  be  used  as  a  correlator  by  simply  reversing 
the  ordering  of  the  sequence  y(n)  that  the  input  data  x(n)  is  to  be  correlated  with,  and 
using  that  reversed  sequence  as  the  FIR  filter  coefficients.  The  system  function  of  the  FIR 
filter  is  obtained  by  taking  the  z  transform  of  (1)  which  yields 

M-\ 

H(z)  =  53  /i(nU-n  (5) 

n=0 

This  can  be  written  as  a  recursive  equation: 

H(z)  =  H0(z)  (6) 


with 


Hk{z) 


f  h(k)  -(-  z  1Hk+\{z)  for  k  =  0,  •  •  •,  M  —  1 
|  0  for  k  >  M  —  1 


(7) 


Notice  that  each  recurrence  of  (7)  describes  a  single  filter  tap.  That  is,  the  outp  of  the 
current  tap  Hk(z )  is  the  sum  of  two  terms.  One  is  the  product  of  the  input  data  and  the 
filter  coefficient  h(k),  and  the  other  is  the  output  of  the  previous  tap  Hk+i(z)  after  passing 
through  a  unit  delay  z~l .  Implementing  H(z)  using  (6)  and  (7)  directly  results  in  the  well- 
known  transposed  (or  inverted)  direct  form  FIR  structure  shown  in  Fig.  1.  (The  index  k  in 
(7)  advances  from  0  to  M—  1  from  right  to  left  in  Fig.  1.) 
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Figure  1:  Transposed  direct-form  realization  of  FIR  system. 

2.2  Signed-Digit  Representation 

We  use  signed-digit  representation  to  specify  the  filter  coefficients.  A  radix-2  signed-digit 
fractional  number  C  is  represented  by 

C  =  f>2-‘  (8) 

k=0 

where  c*  is  a  signed-digit  in  the  set  {-1,  0,  1},  and  C  has  a  word  length  o{  N  +  l  digits. 
In  general,  the  signed-digit  representation  for  a  given  number  is  not  unique.  A  minimal 
representation  is  one  that  requires  the  least  number  of  nonzero  digits.  Among  the  minimal 
representations,  there  exists  a  unique  representation  known  as  the  canonic  signed-digit  (CSD) 
representation  for  which  no  two  nonzero  digits  are  adjacent.  The  advantage  of  a  minimal 
signed-digit  representation  such  as  CSD  is  that  there  are  fewer  nonzero  terms  in  (8),  which 
results  in  fewer  partial  products  when  the  number  C  multiplies  another  number. 

Algorithms  for  computing  CSD  coefficients  for  FIR  filters  that  meet  arbitrary  specifica¬ 
tions  have  been  developed  [6,10,11].  In  general,  these  algorithms  seek  to  limit  the  number 
of  nonzero  digits  used  to  represent  each  signed-digit  fractional  coefficient  value.  That  this  is 
feasible  in  practice  is  demonstrated  by  the  observation  in  [6]  that  only  one  nonzero  digit  in 
the  CSD  representation  is  typically  required  for  each  20  dB  of  stopband  attenuation  in  the 
filter  specification,  with  an  additional  nonzero  digit  allocated  to  those  impulse  response  co¬ 
efficients  whose  magnitude  exceeds  1/2.  Thus  a  coefficient  can  be  represented  with  a  limited 
number  of  signed  digits  as 

C  =  £c„2-’*  (9) 

k= 0 

where  c*  is  a  signed-digit  in  the  set  {-1,  0,  1},  and  p*  €  {0, pk  now  signifies  the 
position  of  the  signed-digit  c*.  Notice  that  C  can  have  up  to  L  +  1  nonzero  digits  and  that 
its  effective  word  length  is  still  TV"  -(-  1  digits. 


2.3  Programmable  Unit-tap 

The  complexity  of  a  programmable  FIR  filter  is  determined  both  by  its  length  and  by  the 
number  of  nonzero  digits  allocated  to  each  filter  tap.  As  pointed  out  in  the  previous  two 
sections,  a  filter  with  higher  stopband  attenuation  demands  a  larger  number  of  nonzero 
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digits  for  its  coefficients,  whereas  a  filter  with  a  sharper  transition  band  demands  a  larger 
number  of  filter  taps.  Clearly,  satisfying  both  demands  will  tend  to  require  a  chip  with 
an  uneconomically  large  silicon  area.  Furthermore,  either  the  large  number  of  taps  or  the 
large  number  of  coefficient  digits  would  be  wasted  for  filters  with  wide  transition  bands 
or  low  stopband  attenuations,  respectively.  These  wasted  resources  might  otherwise  be 
used  to  realize  a  filter  with  a  larger  number  of  taps  or  coefficient  digits,  whichever  the 
application  requires.  The  required  precision  for  each  coefficient  is  also  non-uniform  among 
all  the  coefficients.  For  example,  the  coefficient  values  that  require  higher  precision  are  often 
near  the  center  of  the  impulse  response  of  a  typical  lowpass  FIR  filter.  Furthern  re,  in  the 
case  of  a  correlator,  it  is  uneconomical  to  allocate  full  precision  to  each  tap  because  the 
average  number  of  nonzero  digits  per  tap  may  only  be  approximately  N/3  [12].  In  some 
correlator  applications,  many  taps  have  zero  value. 

These  difficulties  can  be  overcome  by  having  the  number  of  nonzero  digits  allocated  to 
each  filter  tap  be  one  of  the  aspects  of  the  chip’s  programming.  This  can  be  achieved  by 
replacing  the  z~l  factor  in  (7)  by  a  programmable  factor  z~qk  so  that  the  filter’s  transfer 
function  becomes 

H(z)  =  H0(z)  (10) 


with 


Ck+z  qkHk+i(z)  for  k  =  0,  •  •  •,  M  -  1 


rj  /  \  _  J  r  *  *vi  »  —  v,  , 

Hk^Z’~\0  for  fc  >  M  —  1  (U) 

where  qk  €  {0, 1,  •  •  • ,  Q],  and  Ck  is  represented  using  (9)  with  up  to  L  +  1  nonzero  digits, 


(12) 


i=o 


We  call  the  physical  realization  of  each  recurrence  of  (11)  a  p-tap  to  distinguish  it  from 
the  filter  tap  in  (7).  We  also  call  Ck  the  p-tap  coefficient  to  distinguish  it  from  the  filter 
coefficient  h(k)  in  (7).  L  should  be  a  small  integer  such  that  Ck  is  a  low-precision  number, 
allowing  each  p-tap  to  be  implemented  with  minimal  silicon  area.  Hence  a  large  number  of 
p-taps  can  be  realized  economically.  When  qk  =  1,  the  corresponding  Ck  of  (11)  is  equivalent 
to  the  coefficient  of  an  ordinary  filter  tap.  Thus,  a  long  filter  that  has  a  sharp  transition 
band  can  be  programmed  with  low-precision  coefficients.  When  qk  >  1,  qk  —  1  filter  taps  with 
zero  coefficients  are  realized  by  a  single  p-tap.  This  is  useful  for  implementing  Qth  band 
filters  [13,  pages  151-157],  or  for  implementing  a  correlation  sequence  with  many  zero-value 
data.  When  qk  =  qk+\  =  •  ■  •  =  qk+j-i  =  0  and  qk+j  =  1,  the  terms  Ck,  Ck+i,-  •  • ,  C*+j  are 
merged  to  form  a  single  filter  tap  whose  effective  coefficient  value  h(n)  is 

k+j 

h(n)  =  £C,  (13) 

l=k 


which  has  j  +  1  times  the  number  of  coefficient  digits  (i.e.,  precision)  of  a  single  p-tap.  Thus, 
a  filter  with  high-precision  coefficients,  for  implementing  a  large  stopband  attenuation  ,  can 
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Figure  2:  A  p-tap. 

be  programmed  by  trading-off  the  total  number  of  filter  taps.  Since  the  qt  are  individually 
programmable,  a  large  variety  of  filters  can  be  programmed. 

An  example  of  a  specific  realization  of  a  p-tap  (the  one  implemented  in  our  prototype 
chip)  is  shown  in  Fig.  2.  In  this  example,  the  number  of  nonzero  digits,  L  +  1,  in  each  p-t&p 
is  2,  and  qk  €  {0, 1}.  The  choice  of  having  two  coefficient  digits  per  tap  is  partly  due  to 
the  observation  in  [14]  that  the  optimal  (in  efficiency)  number  of  full  adder  stages  between 
pipeline  registers  is  two.  The  programmable  z~9k  term  is  implemented  by  a  switchable  unit- 
delay  register  which  is  turned  on  (not  bypassed)  when  qk  =  1,  and  turned  off  (bypassed) 
when  qk  =  0.  This  is  indicated  schematically  by  the  dotted  line  in  the  figure.  When  the  unit- 
delay  is  on  the  p-tap  operates  as  a  conventional  filter  tap.  When  off  the  summation  node 
is  connected  immediately  to  the  summation  node  of  the  next  p-tap,  merging  the  coefficient 
digits  for  the  current  p-tap  and  the  next  into  a  single  filter-coefficient.  If  the  unit-delay  of 
the  next  p-tap  is  on  then  the  current  p-tap  together  with  the  next  p-tap  effectively  forms  a 
single  filter  tap  that  has  twice  the  number  of  nonzero  coefficient  digits  than  that  of  a  single 
p-tap.  More  nonzero  coefficient  digits  can  be  added  by  combining  additional  p-taps  in  this 
manner.  Fig.  3  illustrates  three  filter  taps  programmed  to  have  2,  4  and  6  coefficient  digits. 

2.4  Efficient  Coding  of  Coefficients 

In  a  programmable  filter  both  Ck  and  pk  must  be  made  programmable  over  the  range 
ck  €  {  —  1,0,1}  and  pk  €  {0, Notice  that  the  fundamental  property  of  the  CSD 
representation,  that  no  two  nonzero  digits  are  adjacent,  would  allow  pk  to  be  programmed 
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Figure  3:  Filter  taps  programmed  with  different  coefficient  digits. 


over  a  more  restricted  range: 

pk  e  {2k,2k+l,2k  +  2,---,2k  +  N -2L)  (14) 

for  k  =  0,  •  •  • ,  L.  However,  by  using  a  full  programmable  range  of  {0,  •  •  • ,  TV},  we  can  simplify 
the  hardware  required  for  storing  and  multiplying  the  coefficients,  as  will  be  shown  shortly. 
Furthermore,  if  pk  is  implemented  by  a  programmable  shifter  using  a  series  of  multiplexors, 
very  little  if  any  silicon  area  would  actually  be  saved  by  using  the  restricted  range,  due  to 
the  disruption  of  the  regularity  of  the  design.  The  only  savings  would  be  the  smaller  number 
of  multiplexors  needed. 

In  our  implementation  of  the  p-tap,  as  shown  in  Fig.  2,  two  coefficient  digits  are  allocated 
to  each  p-tap,  forming  a  coefficient: 

C  =  Co2-P0  +  c12~Pl .  (15) 


Since  we  permit  both  po  and  px  to  vary  from  0  to  N,  the  necessity  to  allow  cq  and/or  C\  to 
be  zero  can  be  eliminated  by  the  following  simple  transformations:  (i)  If  C  =  0  is  required, 
use: 

0  =  2'p* +(-l)2~p*.  (16) 

(ii)  If  the  coefficient  C  requires  only  one  nonzero  digit  ck,  we  expand  it  into  a  two-nonzero¬ 
digit  equivalent  using  one  of  the  following  representations: 


f  ck 2  b’*+1)  +  ck  2  when  pk  <  N 

(  ck 2-fp*-1)  —  ck 2~Pk  when  pk  >  0. 


(17) 


Thus,  the  values  required  for  each  ck  now  become  {-1,1}  instead  of  the  conventional  {- 
1,0,1}  for  the  CSD  representation.  The  elimination  of  the  zero  value  simplifies  the  hardware 
for  coefficient  multiplication  and  reduces  the  storage  requirements  for  the  coefficient  digits 
(a  single  bit,  instead  of  two,  is  now  sufficient  to  represent  each  coefficient  digit  ck).  A  similar 
transformation  can  be  made  to  eliminate  the  zero  digit  for  implementations  of  a  p-tap  with 
more-than-two-digit  coefficients. 
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3  Circuit  Implementation 

The  effectiveness  of  our  new  programmable  architecture  depends  upon  the  efficient  imple¬ 
mentation  of  the  switchable  unit-delay,  the  adder,  and  the  coefficient  multiplier.  These  will 
be  discussed  in  the  following  sub-sections. 


Figure  4:  Schematic  of  the  switchable  unit-delay  register. 


3.1  Switchable  Unit- Delay 

We  use  a  single-phase  edge-triggered  clocking  scheme  to  simplify  on-chip  clock  distribution 
and  because  it  has  been  shown  that  high  speed  single-phase  clocking  can  be  achieved  in 
CMOS  circuitry  [15].  Fig.  4  shows  our  circuit  for  the  switchable  unit-delay,  which  is  identical 
to  the  true  single-phase  latch  in  [15]  except  for  the  N-MOS  bypass-transistor  ml2.  (This 
bypass  can  also  be  implemented  with  a  full  CMOS  transmission  gate  with  an  additional 
P-MOS  transistor.)  With  the  addition  of  the  single  transistor  ml2,  the  unit-delay  becomes 
switchable.  When  “pass”  is  low,  the  leading  edge  of  the  clock  latches  the  data  at  “in.”  This  is 
the  normal  unit-delay  operation.  When  “pass”  is  high  and  “clock”  is  low,  input  data  is  passed 
through  the  input  inverter  (ml,  m2,  m3),  through  ml2,  and  through  the  output  inverter 
(mlO,  mil)  to  the  output,  thus  disabling  the  unit-delay  action.  Notice  that  the  clock  signal 
must  be  disabled  when  “pass”  is  enabled.  While  this  requires  additional  circuitry  to  disable 
the  clock  signal,  this  scheme  has  the  overwhelming  advantage  of  providing  a  simple  switchable 
unit-delay  circuit  having  no  additional  power  dissipation  due  to  an  actively  switching  clock 
signal  when  the  unit-delay  register  is  bypassed.  The  additional  circuitry  to  disable  the  clock 
signal  is  also  insignificant  because  the  clock  signal  is  common  to  all  the  unit  delays  in  a  p-tap 
within  the  same  data  word.  For  example,  in  our  prototype  chip,  a  clock  line  is  common  to 
80  registers. 
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Figure  5:  A  section  of  p-taps  implemented  with  carry-save  adders. 


Figure  6:  Schematic  of  the  transmission  gate  adder. 


3.2  Adder 

Carry-save  additions  are  used  for  the  summation  node  in  each  p-tap  to  avoid  the  carry- 
ripple  delay.  With  a  two-digit  p-tap  coefficient,  two  partial  products  are  produced  by  the 
multiplication  of  the  input  data  and  the  p-tap  coefficient.  Because  carry-save  addition  is 
used,  the  data  sample  from  the  previous  p-tap  consists  of  both  the  sum  and  carry  outputs, 
therefore  the  summation  node  in  each  p-tap  needs  to  add  together  four  terms.  This  requires 
the  cascade  of  two  full  adders  as  shown  in  Fig.  5 

The  adders  are  implemented  with  CMOS  transmission  gates  as  shown  in  Fig.  6.  Both 
the  carry  and  the  sum  outputs  are  inverted  to  eliminate  output  inverters,  which  reduces 
the  transistor  count  as  well  as  the  adder  delay.  Since  an  adder  with  inverted  outputs  and 
non-inverted  inputs  is  equivalent  to  an  adder  with  inverted  inputs  and  non-inverted  outputs, 
the  cascade  of  two  inverted  output  adders  restores  the  correct  output  polarity.  However,  the 
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full  addon  delay  adder*  «Mh  a  (ut  path 


Figure  7:  Critical  path  of  a  filter  tap  with  two  p-taps. 

signal  path  that  does  not  pass  through  both  adders  requires  an  additional  inverter  (INV)  as 
shown  in  Fig.  5.  Since  the  inverter  is  not  in  the  critical  path  of  the  cascaded  adders,  it  does 
not  degrade  the  speed  performance.  Notice  that  the  outputs  of  the  transmission  gate  adders 
have  reduced  logic-high  voltage  levels  due  to  the  threshold  voltage  drop  of  the  N-transistor 
pass  gates.  The  voltage  level  is,  however,  restored  by  the  programmable  unit-delay  register 
before  feeding  to  the  next  adder  stage. 

The  adder  has  a  very  fast  signal  path  from  its  C  input  to  both  its  carry  and  sum  outputs. 
The  delay  from  this  “fast”  input  is  only  one  transmission  gate  delay  to  the  sum  output,  and 
a  transmission  gate  delay  plus  an  inverter  delay  to  the  carry  output.  The  presence  of  this 
“fast”  input  is  used  to  improve  the  speed  of  the  cascaded  adders  as  follows.  When  two  or 
more  p-taps  are  merged  together  to  form  a  filter  tap,  the  adders  are  connected  in  series. 
However,  since  the  partial  products  are  computed  simultaneously,  the  delay  of  the  adder 
chain  can  be  reduced  by  designing  the  full  adder  such  that  it  has  a  fast  path  from  one  of  its 
inputs  to  both  its  sum  and  carry  outputs.  By  feeding  the  two  partial  products  to  the  two 
“slower”  inputs,  the  critical  path  delay  for  each  pair  of  cascaded  full  adders  is  only  a  normal 
full  adder  delay  plus  the  fast  adder  path.  Therefore,  the  critical  path  for  a  filter  tap  is 

l total  ~  ladder  {K  1  )laddcrja,t  "I"  (18) 

(■fi  delay(ofj)  "b  lunit—delay(on) 

where  ladder  is  the  full  delay  through  a  pair  of  cascaded  full  adders,  tadderfa„  is  the  delay 
through  i  pair  of  cascaded  full  adders  with  a  fast  path,  and  K  is  the  number  of  p-taps  that 
are  merged  into  a  filter  tap.  This  is  illustrated  in  Fig.  7. 
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Figure  8:  Schematic  of  the  two-level  NMOS  multiplexor. 


3.3  Coefficient  Multiplier 

Each  coefficient  digit  Ck 2~Pk  consists  of  two  factors,  one  is  the  2~->k  weighting  factor,  which 
can  be  implemented  by  a  right  shift,  and  the  other  is  the  Ck  bit  multiplication  factor. 

The  2~Pk  shifting  is  realized  by  selecting  one  of  16  (the  word  length  of  the  input  data) 
hardwired  preshifted  data  via  two  levels  of  4-to-l  NMOS  transmission  gate  multiplexors 
(Fig.  8).  The  advantage  of  the  two- level  multiplexing  is  the  reduction  in  the  number  of 
control  lines  to  eight.  To  save  silicon  area,  each  block  of  hardwired  preshift  is  shared  by  four 
sets  of  multiplexors  (or  two  p-taps,  since  each  p-tap  has  two  coefficient  digits). 

Since  c*  is  either  1  or  -1,  and  never  0,  multiplication  for  each  digit  is  easily  handled 
by  an  invert /no-invert  circuit  realized  by  a  simple  exclusive-OR  gate.  This  forms  the  l’s 
complement  of  the  shifted  data  for  the  case  of  a  negative  coefficient  digit.  The  LSB  of  1 
that  needs  to  be  added  to  form  the  2’s  complement  negation  is  accumulated  into  a  sum  for 
all  the  coefficient  digit  multipliers.  This  sum  forms  part  of  the  compensation  vector  that  is 
added  to  the  first  p-tap  in  the  forward  datapath  (which  has  free  adder  inputs)  [7]. 

Due  to  the  considerable  delay  incurred  by  the  long  input  data  bus  and  the  two-level 
transmission  gate  multiplexor,  a  pipeline  register  (shown  as  Rp  in  Fig.  2)  is  inserted  after 
the  coefficient  multiplier  in  order  to  obtain  a  higher  maximum  data-rate. 
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4  Prototype  Chip 
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Figure  9:  Block  diagram  of  the  programmable  FIR  chip. 

Fig.  9  shows  the  block  diagram  of  our  programmable  linear-phase  FIR  filter  chip.  It  has 
16-bit  input  and  output  data.  Its  internal  word  length  is  chosen  to  be  20-bit  to  ensure  that 
the  error  at  the  filter’s  output  due  to  internal  quantization  is  less  that  the  quantization  error 
due  to  the  finite  word  length  of  the  data.  The  core  of  the  chip  is  the  series  of  32  p-taps, 
folded  to  share  the  symmetrical  coefficients  for  linear-phase  operation.  Surrounding  the  core 
are  the  clock  and  data  drivers,  the  vector  merge  adder  (VMA),  the  compensation  vector 
register  (CVR),  the  programmable  inverters  (PINV),  the  coefficient  registers,  and  testing 
circuitry. 

The  carry  and  sum  outputs  from  the  last  p-tap  are  added  using  a  20-bit  VMA  to  produce 
the  final  output.  The  VMA  is  implemented  by  a  five  stage  pipelined  carry-ripple  adder.  The 
pipelining  removes  the  VMA  from  the  filter’s  critical  path. 

The  programmable  compensation  vector  register  (CVR)  is  used  to  correct  the  filter  core 
output  by  adding  in  the  MSB  sign-extension  and  the  additional  l’s  needed  for  2’s  complement 
negation.  It  can  also  be  used  to  select  between  rounding  or  truncation.  The  compensation 
vector  is  programmed  through  the  input  data  bus  because  of  the  limited  number  of  pins 
(84)  available  on  our  small  die.  A  programmable  inverter  (PINV)  is  inserted  in  the  middle 
of  the  series  of  p-taps  to  permit  the  chip  to  implement  filters  with  either  symmetrical  or 
anti-symmetrical  impulse  responses. 

To  facilitate  the  testing  of  the  chip,  a  16-bit  pseudo  random  number  generator  (PRNG) 
and  an  output  decimator  (DEC)  are  implemented  on-chip.  The  PRNG  is  based  on  the  type  2 
linear  feedback  shift  registers  [16,  pages  432-441]  which  will  produce  a  pseudo  random  number 
sequence  provided  that  the  states  of  the  linear  feedback  shift  registers  are  not  identically 
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Table  1:  Summary  of  the  prototype  programmable  FIR  chip. 


Maximum  FIR  order 

32 

Technology 

1.2-/xm  CMOS 

I/O  word  length 

16-bit 

Coefficient  word  length 

16-bit 

Internal  word  length 

20-bit 

Core  area 

4.2  x  2.8  mm2 

Die  size  (with  pads) 

5.9  x  3.4  mm2 

Maximum  data-rate 

180  MHz 

Power  Supply 

5  V 

Power  consumption 

1.3  W  @  180  MHz 

Packaging 

84-pin  PGA 

zero  (which  will  produce  a  sequence  of  constant  zeros).  To  avoid  the  zero  state,  the  starting 
state  of  the  PRNG  is  made  to  be  programmable  through  the  input  data  bus.  The  output 
decimator,  when  not  bypassed,  decimates  the  output  samples  by  a  factor  of  16.  In  our  test 
setup,  when  testing  is  performed  within  the  frequency  range  of  our  tester  (<  50  MHz),  the 
output  decimator  is  bypassed  and  input  test  vectors  are  applied  by  the  tester.  To  perform 
testing  beyond  the  frequency  range  of  the  tester,  the  clock  signal  to  the  chip  is  supplied  by 
an  external  high  frequency  source,  the  PRNG  is  turned  on,  and  the  output  is  decimated 
and  sampled  asynchronously  by  the  tester.  A  computer  program  is  used  to  correlate  the 
outputs  sampled  by  the  tester  with  the  calculated  result  to  verify  the  chip’s  functionality 
at  the  higher  speed.  This  permits  us  to  verify  the  core  of  the  chip  to  at  least  8  times  the 
sampling  speed  of  our  tester. 

The  chip  was  designed  using  the  Mentor  Graphics  GDT  VLSI  CAD  tools.  The  leaf  cells 
for  the  chip  are  all  custom  layouts,  so  as  to  obtain  the  best  performance.  The  leaf  cells  are 
assembled  by  a  compiler  with  parameterized  word  length  and  number  of  p-taps.  Thus,  any 
size  filter  chip  can  be  generated  very  easily.  The  compiler  is  written  in  the  Genie  language,  a 
C-like  interpreted  language  with  interface  to  access  the  GDT  layout  database.  A  summary  of 
the  prototype  chip  is  given  in  Table  1.  The  prototype  chip  (Fig.  10)  was  fabricated  through 
the  MOSIS  service  using  the  Hewlett-Packard  1.2-/zm  CMOS  N-well  process. 

The  prototype  chip  has  been  tested  to  operate  up  to  a  data-rate  of  180  MHz,  for  filter 
taps  consisting  of  single  p-taps  (i.e.,  at  most  two  nonzero  CSD  digits  per  filter  tap).  For  the 
case  of  two  p-taps  merged  to  form  a  single  filter  tap  (i.e.,  at  most  four  nonzero  CSD  digits 
per  filter  tap),  the  chip  will  operate  up  to  a  data-rate  of  90  MHz.  However,  the  input  data 
can  also  be  applied  to  two  programmable  FIR  filters,  each  having  half  the  number  of  filter 
coefficient  digits  per  filter  tap.  The  outputs  of  these  filters  can  then  be  added  together  by 
an  ■'dditional  adder.  When  configured  this  way,  the  maximum  180  MHz  data-rate  can  be 
achieved  for  filters  whose  taps  would  require  up  to  four  non-zero  CSD  digits.  This  concept 
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Figure  10:  Photograph  of  the  prototype  chip. 


can  be  extended  to  include  more  parallel  programmable  FIR  filters  for  operations  at  the 
maximum  data-rate  while  having  filter  taps  with  more  than  four  non-zero  CSD  digits. 
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5  Design  Examples 

Three  example  filters  have  been  designed  to  show  the  versatility  of  the  proposed  architecture: 
a  32-tap  lowpass  filter,  a  16-tap  lowpass  filter,  and  a  32-tap  bandpass  filter.  All  three 
filter  designs  were  constrained  such  that  they  could  be  implemented  on  the  prototype  chip 
described  in  Section  4  (i.e.,  at  most  32  taps  with  two  nonzero  CSD  digits  per  tap  and  a 
16-bit  shift  range).  With  a  larger  filter  core  (i.e.,  more  p-taps)  more  demanding  filters  could 
be  implemented. 


h(0)  =  +2"n  +  2~14 
h(l)  =  — 2-11 
h(2)  =  -2~9  —  2-11 
h(3)  =  — 2~10  +  2-14 
h(4)  =  +2-8  +  2-u 
h(5)  =  -f  2~7  —  2-9 


h(6)  =  -2~8  -  2“10 
h(7)  =  — 2~6  4-  2-9 
h(8)  =  -2~3  +  2_n 
h(9)  =  +2~5  -  2-7 
h(  10)  =  +2-6  +  2-8 
h(ll)  =  -2-5  +  2-7 


h(12)  =  — 2-4  -f  2~7 
h(13)  =  —2-9  —  2~12 
h(14)  =  +2~3  +  2~6 
h(15)  =  +2-2  +  2"6 


Figure  11:  Frequency  response  and  CSD  coefficients  for  32-tap  FIR  lowpass  filter  of  Exam¬ 
ple  1. 

Example  1:  This  example  filter  is  a  32-tap  lowpass  filter  with  normalized  passband  and 
stopband  edge  frequencies  of  0.15  and  0.25,  respectively.  The  filter  achieves  a  normalized 
stopband  attenuation  of  41.5  dB  with  a  peak-to-peak  passband  ripple  of  0.074  dB  using 
only  two  nonzero  CSD  digits  per  tap.  The  coefficients  for  this  filter  and  the  corresponding 
frequency  response  are  shown  in  Fig.  11.  This  filter  requires  a  total  of  32  p-taps  when 
implemented  on  our  prototype  chip.  Since  this  filter  requires  at  most  two  coefficient  digits 
per  filter  tap,  the  maximum  data-rate  achievable  when  implemented  on  our  prototype  chip 
is  180  MHz. 

Example  2:  This  example  filter  is  a  16-tap  lowpass  filter  with  normalized  passband  and 
stopband  edge  frequencies  of  0.125  and  0.35,  respectively.  The  filter  achieves  a  stopband 
attenuation  of  77.3  dB  with  a  peak-to-peak  passband  ripple  of  0.1  dB  using  three  or  four 
nonzero  CSD  digits  per  filter  tap.  The  filter  coefficients  and  the  corresponding  frequency 
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FREQUENCY  (cycles/sample) 


h(0)  =  2_u  —  2-14  4-  2~16 
h(l)  =  2-7  -  2-9  +  2-11  —  2-13 
h(2)  =  2-6  -  2-9  -  2“n  +  2-13 
h(3)  =  -2~7  -  2'11  +  2“ 13  +  2~15 


h(4)  =  —2~4  +  2~7  -  2"10  -  2"15 
h(5)  =  — 2-5  +  2-7  +  2-13  +  2~16 
h(6)  =  2"2  -  2~4  -  2"6  +  2~9 
h(7)  =  2-1  —  2-3  +  2-6  +  2~8 


Figure  12:  Frequency  response  and  CSD  coefficients  for  16-tap  FIR  lowpass  filter  of  Exam¬ 
ple  2. 

response  are  shown  in  Fig.  12.  When  implemented  on  our  prototype  chip,  this  filter  requires 
a  total  of  32  p-taps  and  therefore  fully  utilizes  the  hardware.  The  largest  tap  for  this  filter 
requires  four  nonzero  digits  (i.e.,  two  p-taps)  and  therefore  the  maximum  data-rate  achievable 
by  the  prototype  chip,  for  this  filter,  is  90  MHz 

Example  3:  This  example  filter  is  a  32-tap  bandpass  filter  with  the  first  stopband  edge 
frequency  of  0.1  (normalized),  passband  edge  frequencies  of  0.2  and  0.3,  and  the  second 
stopband  edge  frequency  of  0.4.  The  filter  achieves  normalized  attenuation  levels  of  47.6  dB 
and  49.9  dB  in  the  first  and  second  stopbands,  respectively,  and  a  peak-to-peak  passband 
ripple  of  0.04  dB,  while  using  only  two  nonzero  CSD  digits  per  filter  tap.  The  coefficients 
for  this  filter,  and  the  corresponding  frequency  response  axe  shown  in  Fig.  13.  This  filter 
requires  a  total  of  32  p-taps  when  implemented  on  our  prototype  chip.  Like  example  1,  since 
the  largest  number  of  coefficient  digits  per  filter  tap  is  only  two,  the  maximum  data-rate 
achievable  by  our  prototype  chip,  for  this  filter,  is  180  MHz. 

These  three  examples  demonstrate  the  efficiency  of  our  architecture.  By  contrast,  a 
straightforward  programmable  FIR  filter  chip  capable  of  implementing  all  three  of  these 
example  filters  with  a  uniform  filter  tap  structure  would  require  32  taps  with  each  tap 
having  four  nonzero  digits.  Thus,  hardware  for  a  total  of  128  nonzero  CSD  digits  would  be 
required.  In  our  prototype  chip,  however,  we  are  able  to  implement  all  three  filters  using 
only  32  2-digit  taps  (p-taps),  or  a  total  of  64  nonzero  digits — a  savings  of  50%. 
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h(0)  =  -2-10  +  2-12 
h(l)  =  -2"n 
h(2)  =  -2-9  —  2-11 
h(3)  =  +2-8  +  2“ 10 
h(4)  =  +2~7  —  2-11 
h(5)  =  -2"8  -  2"13 


h(6)  = 

+2-8 

_2-i° 

h(7)  = 

—2-6 

-  2-15 

h(8)  = 

—2-5 

-f-2-8 

h(9)  = 

+2"5 

_2-n 

h(10)  : 

=  +2- 

6  +  2-9 

h(ll)  : 

=  +2- 

6  +  2-9 

h(  12)  =  +2-4  +  2"7 
h(  13)  =  — 2~3  —  2-7 
h(14)  =  — 2~2  +  2-4 
h(15)  =  +2“2  -  2-5 


Figure  13:  Frequency  response  and  CSD  coefficients  the  32-tap  FIR  bandpass  filter  of  Ex¬ 
ample  3. 

6  Conclusion 

We  have  presented  a  new  architecture  for  the  implementation  of  the  transposed  FIR  digital 
filter.  We  use  a  novel  switchable  unit-delay  to  allocate  the  optimal  hardware  resources  to 
each  filter  tap.  Moreover,  a  simple  recoding  of  the  coefficient  values  results  in  a  simplification 
of  the  digit  multiplication  hardware.  A  prototype  chip  that  can  realize  FIR  filters  with  up 
to  32  linear-phase  taps  with  16-bit  I/O  has  been  implemented  within  a  die  size  of  5.9  mm  by 
3.4  mm  using  1.2-pm  CMOS  technology.  The  chip  has  been  fabricated  and  tested  to  operate 
at  data-rates  up  to  180  MHz. 

While  our  new  programmable  structure  is  capable  of  implementing  filters  designed  using 
existing  algorithms  for  designing  filters  with  Powers-of-Two  coefficients,  it  will  benefit  from 
more  specialized  algorithms  that  can  exploit  our  unique  programmable-tap  structure.  That 
is,  by  taking  advantage  of  our  ability  to  use  a  small  number  of  nonzero  digits  for  many  taps, 
we  can  expect  to  design  significantly  longer  FIR  filters  than  could  be  implemented  with 
presently  available  CSD  FIR  approaches.  A  promising  algorithm  has  been  reported  in  (11], 
where  the  number  of  digits  at  each  tap  is  variable  and  the  optimization  algorithm  seeks  to 
minimize  the  total  number  of  coefficient  digits  for  the  entire  filter.  For  the  purpose  of  our 
filter,  the  optimization  algorithm  should  minimize  the  pairs  of  coefficient  digits  at  each  tap 
while  using  as  many  filter  taps  as  possible,  subject  to  the  available  resources  on  a  given  filter 
chip. 
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Automated  Programming  of  Digital  Filters 
for  Parallel  Processing  Implementation 

Michael  J.  Werter,  Member,  IEEE,  and  Alfa  N.  Willson,  Jr.,  Fellow,  IEEE 


Abstract — A  computer  algorithm  h  described  that  automati¬ 
cally  -  rites  optimal  programs  for  the  hnpioneatatioa  ot  arbitrary 
digital  filter  structures  on  parallel  processors.  The  algorithm 
has  been  adapted  particularly  for  programming  a  DSP  chip 
with  multipie  processors  arranged  in  a  ring-type  topology.  The 
algorithm  starts  from  a  netUst  describing  a  desired  digital  fitter 
structure.  The  algorithm’s  output  is  a  set  of  programs  tor  the 
parallel  processors  which  causes  them  to  implement  the  given 
digital  filter. 


I.  Introduction 

HIS  PAPER  DESCRIBES  a  computer  algorithm  that 
automatically  writes  programs  for  the  implementation 
of  digital  filters  on  parallel  processors.  It  has  been  used 
for  implementing  many  common  filter  structures  on  a  new 
digital  signal  processing  (DSP)  chip  [1],  [2].  The  programs 
are  optimal;  that  is,  they  use  the  minimum  number  of  program 
steps  per  data  sample  to  implement  a  given  arbitrary  digital 
filter  structure.  The  algorithm's  “input”  is  a  nedist  describing 
the  desired  digital  filter  structure,  which  is  used  to  define 
a  shift-invariant  data-flow  graph:  a  directed  graph  in  which 
all  operations  (additions,  multiplications,  and  time  delays)  are 
specified  at  the  nodes,  and  in  which  the  branches  are  directed 
paths  specifying  the  flow  of  data  between  nodes  [3H5].  The 
algorithm  first  optimizes  this  flow  graph  to  achieve  the  best 
performance  from  the  parallel  processors  when  implementing 
the  given  filter  structure.  It  next  calculates  a  time  schedule  for 
the  flow  graph’s  arithmetical  operations  and  then  distributes 
these  operations  over  the  multiple  processors,  taking  into 
account  all  the  restrictions  which  appear  due  to  the  topology 
and  the  processors'  architecture.  The  algorithm’s  “output”  is 
a  set  of  programs  for  the  parallel  processors  which  causes  them 
to  implement  the  given  digital  filter. 

For  the  reader’s  convenience,  some  properties  of  the  pro¬ 
grammable  digital  filter  IC  presented  in  [1]  are  briefly  re¬ 
viewed  in  Section  II.  In  Section  m  the  computer  algorithm  is 
described.  In  Section  IV  we  compare  our  algorithm  with  other 
scheduling  algorithms,  and  we  summarize  the  main  results  in 
Section  V. 
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Fig.  1.  Ring-structured  processor  topology. 

II.  a  Ring-structured  Topology  for  Digital  Filtering 

In  [1]  a  DSP  chip  is  described  that  contains  multiple  proces¬ 
sors  placed  in  a  ring-structured  topology  on  a  single  integrated 
circuit  (Fig.  1).  Due  to  this  ring  structure  the  communication 
between  processors  is  restricted  to  neighboring  processors 
only.  For  the  implementation  of  many  popular  digital  filter 
structures  this  restriction  produces  no  disadvantage  over  more 
complex  communication  schemes;  it  has  been  shown  in  [1]  that 
the  ring-structured  parallel  processor  system  can  implement 
filters  using  the  minimum  possible  number  of  sequential 
arithmetic  operations  per  data  sample. 

Since  the  intended  application  of  the  DSP  chip  is  real-time 
digital  filtering,  the  processors  need  only  be  able  to  perform 
the  five  instructions:  add,  subtract,  multiply,  move  (register 
to  register),  and  nop  (no  operation).  The  ALU  consists  of  a 
hardware  multiplier,  a  RAM  to  store  multiplier  coefficients  and 
an  adder/subtractor,  as  shown  in  Fig.  2.  The  ALU  is  pipelined 
so  that  tiie  multiplier  will  execute  in  one  clock  cycle.  This  way, 
it  can  perform  an  addition  and  a  multiplication  simultaneously. 
Since  it  happens  that  most  digital  filters  perform  an  addition 
immediately  following  a  multiplication,  this  ALU  architecture 
makes  it  possible  to  perform  both  functions  in  “essentially” 
one  instruction  step. 

III.  Computer  Algorithm 

In  a  general-purpose  computing  context  the  major  difficulty 
with  most  parallel  architectures  is  specifying  how  to  program 
them.  However,  since  digital  filters  require  no  conditional 
branching  it  is  possible  to  write  a  computer  algorithm  (a  task 
partitioner)  that  analyzes  a  structural  description  of  a  given 
filter  and  writes  optima]  programs  for  parallel  processing.  We 
have  developed  such  an  algorithm.  Its  flow  graph  is  shown  in 
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Fig.  2.  ALU  architecture. 


Fig.  3.  Flow  graph  of  computer  algorithm. 

Fig.  3.  In  this  section,  the  algorithm  will  be  explained  with 
the  aid  of  an  example. 

Section  III  will  show  that  the  filter  programs  written  by 
the  algorithm  are  optimal;  that  is,  they  use  the  minimum 
number  of  program  steps  per  data  sample  to  implement  a  given 
arbitrary  digital  filter  structure.  The  algorithm  calculates  the 
optimum  sampling  period  T„  (Section  m-B),  it  searches  for 
an  optimum  schedule  (Section  III-C)  and  optimum  distribution 
of  the  operations  over  the  processors  (Section  IQ-D).  If  (and 
only  if)  it  is  not  possible  to  implement  the  given  digital  filter 
at  the  optimum  sampling  period,  the  algorithm  increases  the 
sampling  period  by  one  time  unit  (Section  m-F). 


<b) 

Fig.  4.  Second-order  direct  form  n  filler,  (a)  Filler  structure.  (b) 
shift-invariant  data-flow  graph. 


A.  Input 

The  computer  algorithm  starts  with  a  netlist  describing  a 
desired  digital  filter  structure.  That  is,  the  algorithm’s  input 
data  specify  a  shift-invariant  data-flow  graph  by  stating  how 
each  node  k  (representing  an  addition,  multiplication  or  time 
delay)  is  connected  with  other  nodes  of  the  flow  graph.  To 
avoid  ambiguities,  we  assume  that  each  adder  has  two  ingoing 
branches  and  one  outgoing  branch.  We  also  assume  that  any 
desired  filter  is  “realizable”  [6],  [7],  i.e„  that  it  contains  no 
delay-free  directed  loop.  In  addition,  the  filter  is  assumed  to 
be  “proper”  [8]  in  the  sense  drat  there  is  a  directed  path  from 
the  input  to  every  node  and  a  directed  path  to  the  output  from 
every  node  in  the  data-flow  graph.  In  other  words,  all  parts  of 
a  proper  filter  affect  the  input/output  behavior. 

As  an  example.  Fig.  4(a)  shows  the  topology  of  a  second- 
order  direct  form  n  digital  filter  and  the  corresponding  data¬ 
flow  graph  is  shown  in  Fig.  4(b).  The  netlist  input  file  of  this 
example  is  shown  in  Appendix  C. 

Our  algorithm  first  combines  each  multiplication  with  a  sub¬ 
sequent  addition  into  a  two-step  Multiply-Accumulate  (MAC) 
instruction.  If  no  addition  follows  a  multiplier,  “zero”  will 
be  added  to  it  during  the  accumulate  stage1.  The  MAC 
instructions  are  called  supemodes  in  the  data-flow  graph  and 
they  are  depicted  by  dashed  ellipses  in  Fig.  4(b). 


B.  Optimum  sampling  period  T0 

Using  a  well-known  technique  of  Renfocs  and  Neuvo  [8], 
[9],  our  algorithm  calculates  the  theoretical  minimum  sampling 


1  If  a  time-delay  element  is  located  between  a  multiplier  and  an  adder,  then 
the  sequence  of  the  time-delay  and  the  multiplier  will  be  reversed,  so  that  the 
multiplier  can  be  integrated  with  the  adder  in  a  MAC  instruction. 
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period2  Tmin  that  would  be  possible  for  any  custom  parallel 
implementation  of  the  specified  filler  structure  assuming  that 
an  unlimited  number  of  processors  are  available.  Therefore  it 
searches  for  all  directed  loops  in  die  data-flow  graph.  For  every 
directed  loop  l,  it  counts  the  number  of  time-delay  nodes  N( 
and  it  calculates  the  arithmetic  loop  delay  Pi,  which  equals  the 
total  processing  time  consumed  by  the  arithmetical  operations 
in  the  loop.  The  minimum  sampling  period  Tmin  is  calculated 
by 


Tm,n  =  max(Di/Ni)  (1) 

where  the  maximum  is  taken  over  all  directed  loops  in  the  flow 
graph.  A  directed  loop  in  which  this  maximum  is  reached  is 
called  a  critical  loop. 

An  alternative  way  to  compute  the  minimum  sampling 
period  (iteration  period  bound)  is  based  on  the  longest-path 
matrices  and  their  multiplication  (10].  An  advantage  of  that 
algorithm  is  that  it  has  a  polynomial  complexity,  while  the 
search  for  all  possible  directed  loops  in  a  data-flow  graph 
can  grow  as  a  factorial  function  of  the  number  of  time-delay 
nodes,  as  discussed  in  Appendix  A.  The  program  can  easily 
be  modified  to  support  this  alternative  approach. 

The  second-order  filter  example  of  Fig.  4  has  two  recursive 
loops.  The  minimum  sampling  neriod  Tmin,  calculated  with 
the  Renfors  and  Neuvo  algorithm  (8],  is  found  from  the  loop 
containing  Mi,  Ai  and  Ti: 

Tmin  —  Tm  +  TA 


where  Tm  and  TA  denote  the  time  needed  for  a  multiplication 
and  an  addition,  respectively. 

Since  the  implementation  of  the  desired  digital  filter  struc¬ 
ture  must  be  accomplished  on  a  limited  number  of  processors 
P,  our  algorithm  next  calculates  7> ,  the  minimum  total 
computation  time  per  processor.  The  average  computation  time 
of  a  MAC  instruction3  equals  Tm-  The  computation  time  of  an 
add,  subtract  or  move  instruction  equals  TA,  so  the  minimum 
total  computation  time  per  processor  7>  for  an  implementation 
of  the  desired  digital  filter  structure  on  a  limited  number  of 
processors  P  equals 

rr  Ki  •  Tm  +  K2  ■  Ta 

Tp  = - - -  (2) 

where  K\  is  the  total  number  of  supemodes  and  K2  is  the 
total  number  of  other  nodes  in  the  data-flow  graph  which  are 
not  time-delay  nodes. 

The  optimum  sampling  period  for  the  implementation  of  the 
flow  graph  on  a  multiple  processor  system  with  P  processors 
can  now  be  calculated  by 

To  —  m&xCTrmrj,  Tp).  (3) 

From  (2)  and  (3)  we  can  calculate  Pmtn,  the  minimum  number 
of  processors  that  is  needed  to  implement  a  digital  filter  at  the 

2 The  minimum  sampling  period  Tmi„  has  been  called  the  iteration  period 
bound  in  (3),  [10],  and  its  reciprocal  value  is,  of  course,  the  maximum  sampling 
rate  (81. 

3  Our  algorithm  can  handle  pipelined  multipliers  in  which  a  multiplication 
would  be  executed  in  multiple  stages. 


Fig.  S.  Three  topologies  of  second -order  direct  form  II  filter. 


theoretical  minimum  sampling  period  !Tmtn: 


p  — 

1  min  — 


Kx  Tm  +  K2-  Ta 

Tmin 


(4) 


From  (4)  we  conclude  that  the  minimum  number  of  pro¬ 
cessors  needed  to  execute  the  five  MAC  instructions  of  the 
second-order  filter  example  of  Fig.  4  on  a  system  with  TA  - 
Tm  =  one  step  at  the  theoretical  minimum  sampling  period 
Tmin  -  two-steps  is  Pmin  =  3  processors. 

In  Fig.  4(a)  we  see  four  two-input  adders.  Adders  Aj  and 
A2  can,  however,  be  considered  as  one  three-input  adder  with 
ingoing  branches  from  input  x  and  multipliers  Mi  and  M2. 
This  three-input  adder  could  be  implemented  in  three  different 
ways,  as  shown  in  the  three  Fig.  5  structures.  As  is  well 
known,  all  three  implementations  produce  the  same  three-input 
sum  if  two’s  complement  arithmetic  is  used  for  quantization 
and  overflow  correction.  Of  the  three  filters  shown,  only  the 
first  has  the  minimum  sampling  period  Tmin  =  two-steps;  the 
other  two  have  loops  with  two  adders,  one  multiplier  and 
one  time  delay  so  that  Tmin  —  three  steps.  This  shows  that 
it  is  important  to  optimally  sequence  ingoing  branches  of  a 
multiple-input  adder.  For  this  reason  our  algorithm  detects 
all  multiple-input  adders  in  the  desired  filter  structure  and 
provides  the  user  the  option  of  searching  for  the  optimum 
adder  sequence  to  minimize  the  filter’s  sampling  period.  If  this 
option  is  used  the  algorithm  recursively  splits  each  TV-input 
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adder  with  N  >  2  into  an  M-input  adder  and  an  (N  -  M)- 
input  adder,  where  1  <  M  <  N/2.  (A  one-input  adder  is 

simply  a  branch  in  the  data-flow  graph.)  There  are  J 

different  combinations  into  which  the  JV-input  adder  can  be 
split,  yielding  an  M-input  adder  and  an  (N  -  M)-input  adder, 

where  •  Let  S(N)  the  total  number 

of  different  combinations  into  which  an  Af-input  adder  can  be 
split,  then  5(1)  =  1,  and 

S(N)  =  \  £  (^)S(M)S(N-M)  =  (2N-3)W 

M=1 

for  N  >  2 


where  ill  =  i(i  -  2)(i  —  4) ...  5  •  3  ■  1  for  t  odd.  A  proof 
of  this  result  is  given  in  Appendix  B.  The  total  number  of 
combinations  increases  rapidly  with  N,  therefore  it  is  not 
practical  to  check  all  combinations  for  adders  having  many 
ingoing  branches.  In  most  filter  structures,  however,  multiple- 
input  adders  with  many  ingoing  branches  are  rare  (perhaps 
the  most  noteworthy  exception  being  the  direct-form  FIR 
structure).  Thus  it  is,  in  fact,  usually  feasible  to  employ 
our  algorithm’s  option  to  search  for  the  filter  topology  with 
optimally-sequenced  adders  that  yields  the  lowest  minimum 
sampling  period  T0. 


C.  Time  schedule 

Having  found  Ta  the  flow  graph  of  Fig.  3  shows  that  the 
next  task  is  to  determine  the  time  schedule.  The  earliest  time 
T(k)  at  which  the  operation  at  node  k  can  be  started  is  found 
from  a  maximal  distance  spanning  tree  [11],  which  is  a  tree  of 
the  data-flow  graph  containing  all  of  the  flow  graph’s  nodes, 
having  the  property  that  there  is  a  directed  path  from  the 
input  to  each  node  k ,  such  that  the  sum  of  all  processing 
times  in  such  a  path  is  maximal.  The  individual  nodes  in 
a  supemode  of  the  data-flow  graph  cannot  be  separated  in 
the  maximal  distance  spanning  tree  since  they  represent  the 
addition  and  multiplication  of  a  single  MAC  instruction. 
Therefore,  a  branch  within  a  supemode  is  always  a  part  of 
the  maximal  distance  spanning  tree.  As  discussed  in  [8],  the 
processing  time  of  a  time-delay  node  equals  -T0  (a  negative 
value!),  which  causes  the  total  processing  time  of  a  critical 
loop  to  be  zero,  while  the  processing  time  of  all  directed 
noncritical  loops  have  negative  values.  The  latter  implies  that 
there  is  some  (positive)  slack  time  between  the  time  that  the 
execution  of  an  operation  is  completed  and  the  time  that  the 
result  of  this  operation  is  needed  for  further  processing.  These 
slack  times  have  been  called  “shimming  delays”  [7],  and  they 
can  be  depicted  as  (positive)  shimming-delay  blocks  in  some 
of  the  branches  of  the  data-flow  graph.  After  insertion  of  all 
shimming-delay  elements  into  the  data-flow  graph  the  total 
processing  time  of  each  loop  (directed  or  nondirected)  equals 
zero,  where  in  a  nondirected  loop  the  sign  of  the  processing 
time  of  a  node  operation  or  a  shimming  delay  is  reversed  if 
its  direction  is  opposite  to  the  loop’s  reference  direction  in  the 
data-flow  graph.  A  maximal  distance  spanning  tree  is  found 
using  an  algorithm  similar  to  the  Bellman-Ford  method  [12], 


Fig.  6.  Second-order  direct  form  il  filter,  (a)  Original  data  flow  graph,  (b) 
data-flow  graph  after  deleting  of  time-delay  nodes,  (c)  data-flow  graph  after 
rescheduling. 


If  Fig.  6(a),  a  maximal  distance  spanning  tree  for  the 
data-flow  graph  of  the  Fig.  4  filter  example  is  shown  by  thick- 
iined  branches.  From  the  maximal  distance  spanning  tree  we 
calculate  the  earliest  time  at  which  the  operation  at  (super) 
node  k  can  be  started.  The  results  are  shown  in  Fig.  6(a) 
assuming  a  new  input  data  sample  is  available  at  time  T  =  0. 
The  algorithm  next  finds  the  shimming  delays,  which  are 
depicted  in  Fig.  6(a)  by  rectangular  boxes.  Notice  that  the 
critical  loop  contains  indeed  no  shimming  delay. 

Since  all  operations  are  executed  periodically,  and  since 
an  instruction  writes  data  to  the  same  register  as  that  of 
the  previous  sample  period,  all  shimming  delays  should  have 
values  less  than  the  sampling  period  Ta.  If  a  shimming  delay's 
value  equals  or  exceeds  Tc  a  newly  produced  data  sample 
would  be  written  over  old  data  before  it  has  been  used 
for  further  computations.  This  problem  can  be  prevented  by 
employing  additional  move  instructions  which  copy  the  old 
data  (moving  it  to  another  register)  before  the  new  data  is 
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produced.  Alternatively,  we  can  sometimes  avoid  this  problem 
by  rescheduling  the  operations,  or  by  unfolding  the  data-flow 
graph  [14],  In  the  second-order  Alter  example  the  problem 
does  not  arise  since  all  shimming  delays  have  values  which 
are  less  than  Ta. 

Since  all  operations  are  executed  periodically  with  the 
sampling  period  T0  we  next  modify  the  time  schedule  by 
specifying  its  values  modulo  T0:  if  the  time  T(k).  which  is  the 
sum  of  the  processing  Ames  from  the  input  node  to  the  node 
k,  according  to  the  maximal  distance  spanning  tree,  equals 

T(k)  =  m  T0  +  t(k),  with  m  =  integer,  0  <  t(k)  <  Ta 

then  operation  k  will  be  scheduled  to  start  at  step  t(k)  in  the 
program.  Notice  that  the  time-delay  nodes  in  the  data-flow 
graph  have  no  effect  on  the  value  of  <(fc);  consequently  they 
are  now  removed  (“short-circuited”).  The  time  schedule  for 
the  second-order  Alter  example  is  shown  in  Fig.  6(b). 

D.  Operation  distribution 

According  to  the  Fig.  3  flow  graph,  we  must  next  determine 
how  the  operations  will  be  distributed  over  the  processors, 
and  then  check  for  the  accessibility  of  data.  If  we  And  that 
it  is  not  possible  to  appropriately  distribute  the  operations 
over  the  processors  the  operations  will  be  rescheduled;  that 
is,  one  of  the  operations  will  be  selected  to  start  at  a  different 
time  (a  different  step  in  the  nrogram).  In  the  data-flow  graph 
a  rescheduling  can  be  visualized  as  a  pushing  of  shimming- 
delay  elements  through  the  nodes.  The  algorithm  also  adds 
shimming  delays  at  the  Alter’s  input  and  output  node,  which 
are  used  in  the  rescheduling  process.  Except  for  a  pipeline 
delay,  these  additional  shimming  delays  do  not  change  the 
Alter  operation;  the  new  Alter  performs  the  same  sequence  of 
multiplications  and  additions,  and  thus  has  the  same  behavior 
with  respect  to  quantization  errors,  as  the  Alter  without  the 
additional  shimming  delays.  Our  algorithm  checks  which 
operations  can  be  rescheduled  and  it  reschedules  one  of  these 
operations  in  searching  for  a  solution.  It  also  keeps  track  of 
how  much  each  operation  is  shifted  from  the  original  time 
schedule,  to  prevent  duplication  of  rescheduling  operations. 
The  rescheduling  is  repeated  every  time  the  program  fails  to 
(re)distribute  operations  over  the  processors,  until  all  possible 
time  schedules  have  been  checked. 

In  the  Fig.  6(b)  data  flow  graph  of  the  second-order  Alter 
example  we  see  that  there  are  four  MAC  instructions  which 
use  the  result  of  MACi  at  time  t  =  0;  that  is,  immediately 
after  it  has  been  produced.  At  that  time  the  MACi  result 
will  be  available  at  the  ALUR  of  the  processor  where  it  has 
been  produced.  Since  the  ALUR  of  each  processor  in  the  ring 
structure  is  only  accessible  by  its  own  processor  and  by  its 
two  neighbors,  we  can  execute  only  three  instructions  at  time 
t  =  0  which  use  the  MACi  result.  Therefore  we  have  to 
reschedule  one  of  the  operations,  MAC5  for  example,  so  that 
it  starts  at  t  =  1.  The  rescheduling  of  MAC5  does  not  change 
the  sequence  in  which  the  operations  are  executed  since  there 
was  a  shimming  delay  of  one-step  between  MAC5  and  adder 
A4.  After  this  rescheduling  the  second-order  direct  form  II 
Alter  example  can  be  executed  on  three  processors,  which  is 


the  minimum  number  of  processors  needed  to  implement  this 
Alter,  at  the  optimum  sampling  period  according  to  (4).  The 
data:flow  graph  after  rescheduling  is  shown  in  Fig.  6(c). 

The  rescheduling  described  in  this  section  may  seem  equiva¬ 
lent  to  the  retiming  technique  used  in  [13],  which  redistributes 
time-delay  elements  over  a  Alter  structure  and  so  creates  new 
time  schedules.  Our  algorithm,  however,  does  not  redistribute 
time-delay  nodes  (which  have  been  deleted  from  the  data¬ 
flow  graph  after  a  maximal  distance  spanning  tree  was  found), 
but  it  redistributes  the  shimming-delay  elements  over  the  flow 
graph.  And  while  the  retiming  technique  can  improve  the 
sampling  period  of  an  implementation  of  a  digital  Alter  but 
cannot  guarantee  a  schedule  to  be  rate-optimal  [14],  all  our 
implementations  operate  at  the  optimum  sampling  period  T0. 

The  initial  distribution  of  operations  over  the  parallel  pro¬ 
cessors  assigns  each  operation  to  the  processor  with  lowest 
index  that  is  free  during  the  complete  time  it  takes  to  execute 
this  operation.  Notice  that  the  operation  of  checking  for  a  free 
processor  is  performed  modulo  the  sampling  period  T0,  since 
all  operations  are  executed  periodically.  Each  redistribution 
assigns  an  operation  to  the  next  available  processor.  In  this 
way  all  possible  distributions  of  the  operations  over  the  parallel 
processors  can  be  tested. 

E.  Data  accessibility 

After  the  operations  are  distributed  over  the  processors, 
the  computer  algorithm  checks  whether  all  data  can  be  made 
accessible  to  all  processors  that  need  it.  Therefore,  for  each 
data  sample,  a  list  of  processors  needing  it  is  formed.  If  a 
processor  needs  data  that  has  just  been  produced  by  itself 
or  one  of  its  neighbors,  this  processor  can  be  removed  from 
the  list,  since  the  data  can  be  accessed  via  an  ALUR.  For 
all  processors  that  remain  on  the  list,  the  program  checks 
whether  “in-between  processors”  can  move  the  data  from  the 
processor  where  it  was  produced  to  the  one  where  it  is  needed. 

F.  Output 

The  “output”  of  the  computer  algorithm  is  a  set  of  pro¬ 
grams  for  the  parallel  processors  which  causes  them  to  imple¬ 
ment  the  given  Alter  structure. 

It  is  easy  to  show  how  a  parallel  processor  DSP  chip 
containing  as  few  as  three  processors  can  implement  the 
general  second-order  direct-form  II  Alter  of  Fig.  4(a).  The 
parallel  (two-step)  programs  that  our  algorithm  Ands  are  given 
in  Fig.  7(a).  The  Alter  that  the  Fig.  7(a)  programs  implement 
is  shown  in  Fig.  7(b).  This  Fig.  7(b)  Alter  structure  can  easily 
be  derived  from  the  Fig.  6(c)  data-flow  graph  by  inserting 
time-delay  nodes  in  every  branch  where  t  —  0  (including 
branches  within  supemodes)  except  for  the  branches  which 
leave  from  the  input  x,  or  those  going  to  the  output  y.  Notice 
that  the  Fig.  7(b)  structure  actually  implements  the  transfer 
function  z~1H(z),  which  differs  from  the  specifled  H(z) 
by  one  pipeline  delay.  In  general,  the  algorithm  is  capable 
of  taking  advantage  of  additional  pipeline  delays  to  achieve 
efficient  implementations  of  the  specifled  Alter,  so  being  aware 
of,  and  planning  for  such  modiAcations  need  not  be  a  concern 
to  the  user. 
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Fig.  7.  Implementation  of  second-order  direct  form  n  filter,  (a)  Programs, 
(b)  implemented  filter  structure. 

An  examination  of  the  programs  will  demonstrate  how 
data  flows,  from  the  input  x  to  the  output  register  y,  as  the 
processors  communicate  with  each  other  by  means  of  their 
adjacent  shared  register  blocks  and  ALUR  registers. 

We  use  the  arrows  shown  in  the  programs  to  indicate  which 
of  the  two  adjacent  blocks  the  output  data  is  directed  to. 
Thus,  we  indicate  a  clockwise-directed  output  by  a  right¬ 
pointing  arrow,  and  a  counterclockwise-directed  output  by  a 
left-pointing  arrow.  The  two-way  directed  arrow  in  step  2  at 
processor  P2  in  Fig.  7(a)  shows  that  data  Aj  will  be  stored  in 
the  register  blocks  at  both  sides  of  processor  P2. 

To  speed  up  the  task  partitioner’s  distribution  of  operations 
over  the  processors  and  simultaneously  to  reduce  the  commu¬ 
nication  between  processors,  our  algorithm  has  the  option  to 
add  the  additional  constraint  that  an  operation  may  only  be 
assigned  to  that  processor  where  its  input  data  are  created,  or 
one  of  the  adjacent  processors. 

While  we  have  been  able  to  implement  all  practical  exam¬ 
ples  of  digital  filter  structures  at  the  optimum  sampling  period 
T0  on  the  ring  of  processors,  it  is  possible  to  create  contrived 
data-flow  graphs  for  which  an  implementation  on  P  processors 
operating  at  a  sampling  period  T0  does  not  exist  [15].  If  this 
occurs  the  sampling  period  T0  will  be  increased  by  one  time 
unit,  and  the  algorithm  will  then  continue  with  the  initial  time 
scheduling,  as  shown  in  Fig.  3. 

IV.  Comparison  With  Other  Scheduling  Algorithms 

The  computer  algorithm  presented  in  this  paper  has  some 
similarities  with  the  range-chart-guided  iterative  data-flow 
graph  scheduling  of  [15]— [18],  In  [15]  the  following  sched¬ 
uling  methods  have  been  compared  with  each  other: 

1)  Single  iteration  methods  [20],  [21]; 

2)  Direct  blocking  methods  [4],  [18]; 

3)  Fixed  rate  methods  based  on: 

a.  Maximal  distance  spanning  tree  [8],  [9]; 

b.  Optimum  unfolding  [14],  [22],  [23]; 

c.  Cyclo-static  scheduling  [3]— [5]. 


In  this  section,  we  shall  therefore  compare  our  algorithm  with 
the  algorithm  of  [15]. 

All  algorithms  start  with  a  description  of  the  desired  digital 
filter  by  a  data-flow  graph.  Our  algorithm  is  the  only  one 
that  first  combines  each  multiplier  with  a  subsequent  adder 
in  a  two-step  MAC-mstruction;  this,  of  course,  is  dictated  by 
the  advantage  of  implementing  a  MAC  operation  as  a  single 
instruction  on  our  hardware,  which  reduces  the  total  number 
of  instructions  significantly.  The  number  of  instructions  in  an 
FIR  filter  program,  for  example,  is  reduced  by  50  percent  [1]. 

Similar  to  most  scheduling  algorithms,  we  first  calculate  tne 
minimum  sampling  period,  assuming  that  an  unlimited  number 
of  processors  are  available.  Our  algorithm  is  the  only  one 
that  has  the  option  to  automatically  change  the  topology  of 
multiple-input  adders  and  so  improve  the  minimum  sampling 
period.  Unlike  all  scheduling  algorithms  with  the  exception 
of  the  range-chart-guided  scheduling,  we  also  calculate  the 
optimum  sampling  period  assuming  that  a  limited  number  of 
processors  are  available.  We  then  calculate  the  amount  of  time 
over  which  operations  can  be  rescheduled.  We  use  the  maximal 
distance  spanning  tree  to  find  the  earliest  time  at  which  each 
operation  can  be  started,  so  our  “reference  node”  [15]  is  the 
input  node.  Since  all  operations  are  executed  periodically  we 
next  delete  the  time-delay  nodes  from  the  data-flow  graph, 
taking  care  that  shimming  delays  do  not  exceed  or  equal  the 
sampling  period  Ta.  Therefore  our  scheduling  range  is  limited 
to  0  <  t  <  T0.  The  range-chart-guided  scheduling  does  not 
necessarily  place  a  Ta  limit  on  a  program’s  schedule.  In  fact, 
it  appears  that  the  implementation  of  the  second-order  direct 
form  II  filter  shown  in  Fig.  13(b)  of  [15]  has  a  new  c8  data 
value  calculated  before  the  old  c8  value  has  been  used  in  the 
addition  that  forms  c6.  Therefore,  if  the  programs  would  write 
data  to  the  same  register  each  sampling  period,  the  processor 
P4  at  time  t  =  0  performs  the  incorrect  addition  cg(t)  =  07(1) 
+  Cg(t  +  1). 

The  time  schedule  found  with  the  range-chart-guided  sched¬ 
uling  algorithm  does  not  necessarily  produce  the  minimum 
number  of  levels.  An  example  in  which  the  algorithm  produces 
more  than  the  minimum  required  number  of  levels  is  shown 
in  Fig.  8.  Fig.  8(a)  shows  a  circuit  with  four  additions  (one- 
step),  two  multiplications  (two-steps),  and  three  time  delays. 
The  minimum  sampling  period  is  Tmi„  =  4  steps,  which  is 
dictated  by  the  loop  ci,  c2,  C3,  c4,  dj.  If  we  select  cj  as 
our  reference  node  the  scheduling-range  chart  will  become 
as  shown  in  Fig.  8(b).  Following  the  scheduling  algorithm 
presented  in  [15]  the  sequence  in  which  the  operations  are  to 
be  scheduled  is:  1,  2,  3,  4,  5,  6.  The  final  equivalence  class  is 
shown  in  Fig.  8(c).  When  operation  5  is  placed  at  the  upper 
fixed  limit,  we  need  a  third  level  to  place  operation  6.  The 
algorithm  therefore  does  not  find  the  optimum  solution  which 
has  only  2  levels  and  can  be  implemented  on  two  processors, 
as  shown  in  Fig.  8(d). 

Even  if  the  number  of  levels  found  in  the  range-chart- 
guided  scheduling  algorithm  is  minimal,  this  program  does 
not  guarantee  an  implementation  on  the  minimum  number 
of  processors,  as  shown  in  the  following  example.  Consider 
the  assignment  of  operations  in  Fig.  9(a).  In  the  range-chart- 
guided  scheduling  algorithm  the  operations  will  be  assigned 
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Fig.  8  Example  with  no  optimum  range -chart-guided  scheduling,  (a)  Filter 
structure,  (b)  scheduling-range  chart,  (c)  scheduling  accoding  to  (IS],  (d) 
optimum  scheduling. 


Kg.  9.  Example  with  no  optimum  range-chart-guided  distribution,  (a)  Sched¬ 
ule  of  operations  to  be  distributed,  (b)  processor  distribution  according  to 
[15J. 


to  processors  in  the  sequence:  5  to  PI,  4  to  PI,  3  to  P2, 
7  to  P2,  2  to  P2.  At  this  moment,  operation  6  must  be 
assigned  to  a  processor.  However,  since  processor  PI  is  used 
by  operation  4  at  time  t  =  7,  and  since  processor  P2  is  used 
by  operation  3  at  time  t  =  6,  we  have  to  use  a  third  processor 
to  execute  operation  6,  as  shown  in  Fig.  9(b).  This  distribution 
of  operations  over  the  processors  is  not  optimal,  since  we  can 
assign  the  original  schedule  immediately  to  2  processors. 


One  of  the  differences  between  our  algorithm  and  the  range- 
chart-guided  scheduling  algorithm  is  that  the  latter  does  not 
try  to  reschedule  operations  if  the  number  of  levels  exceeds 
the  number  of  processors,  nor  does  it  try  to  redistribute  these 
operations  if  the  assignment  of  operations  to  the  processors  is 
unsuccessful. 

Finally,  none  of  the  other  scheduling  algorithms  seems  to 
have  employed  a  scheduling  of  operations  to  processors  where 
data  communication  is  restricted  due  to  processor  topology 
constraints. 

V.  Conclusions 

In  this  paper,  a  computer  algorithm  has  been  described  that 
automatically  writes  optimal  programs  for  parallel  processors. 
The  algorithm  has  been  adapted  particularly  for  programming 
a  DSP  chip  with  multiple  processors  arranged  in  a  ring-type 
topology,  but  it  can  easily  be  modified  for  other  multiprocessor 
digital  filter  chips.  The  algorithm  can  check  all  possible  timing 
schedules  and  all  possible  distributions  of  the  operations  over 
the  parallel  processors,  taking  into  account  the  constraints 
imposed  by  the  multiprocessor  topology  and  the  processors’ 
architecture.  It  searches  iteratively  for  a  set  of  programs  that 
implements  the  given  digital  filter  at  the  optimum  sampling 
period  on  a  limited  number  of  processors. 

We  have  proved  with  this  computer  algorithm  that,  among 
others,  the  following  common  filter  structures  can  be  imple¬ 
mented  to  execute  in  the  optimal  manner  on  a  single  chip 
which  contains  five  processors  in  a  ring-type  topology  [1]: 
cascades  of  second-order  direct-form  II  filters  (one  program 
step  per  second-order  section,  for  cascades  of  two  or  more 
filter  sections),  arbitrary  FIR  filter  (one  program  step  per  five 
filter  taps),  a  10-th  order  Gray-Markel  lattice  filter  (five-step 
program),  a  general  second-order  state-space  filter  (three- 
step  program),  and  a  fifth-order  wave  digital  filter  (nine-step 
program). 

Since  FIR  filters  and  cascades  of  second-order  direct-form 
II  filters  are  the  most  common  digital  filter  structures,  we  have 
created  a  library  for  the  programs  that  implement  these  types 
of  filters  with  arbitrary  order.  At  the  start  of  our  algorithm,  the 
user  has  the  option  to  directly  call  programs  from  the  library, 
and  after  completion  of  the  scheduling  of  the  operation  over 
the  processors  the  user  can  add  newly-found  programs  to  this 
library. 

While  the  algorithm  exhibits  a  worst-case  running  time 
which  increases  rapidly  with  the  number  of  instructions  in  the 
data  flow  graph,  for  all  practical  examples  the  running  time 
was  quite  acceptable.  Table  I  in  Appendix  A  compares  the 
number  of  computations  for  the  calculation  of  the  minimum 
sampling  period  of  a  ‘‘worst-case  ”  example  filter  using  the 
Renfors-Neuvo  method  (used  in  our  algorithm]  and  using  a 
polynomial-time  algorithm.  Notice  that,  it  is  on'v  when  the 
order  of  a  filter  section  equals  or  exceeds  seven  that  the 
polynomial-time  algorithm  outperforms  our  algorithm.  This 
fact,  along  with  the  fact  that  worst-case  examples  are  hardly 
ever  encountered,  accounts  for  the  quite  acceptable  perfor¬ 
mance  of  our  algorithm,  even  though  in  principle,  one  could 
expect  to  sometimes  encounter  unreasonably  long  computation 
times. 
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TABLE  1 

Total  Number  of  Computations  for  tie  Calculation  of 
the  Minimum  Sampling  Period  Tm,n  of  an  \-th  Order 
Gray-Markel  Lattice  Digital  Filter  with  the  Renfors-Neuvo 
Method  (Mi)  and  with  the  Polynomial-Time  Algorithm  (M2). 


N 

M, 

M2 

1 

5 

9 

2 

15 

72 

3 

43 

272 

4 

136 

710 

5 

534 

1,497 

6 

2,629 

2,756 

7 

15,812 

4,614 

8 

112,504 

7,205 

9 

921,598 

10,670 

10 

8,525,229 

15,151 

VI.  Appendix  A 

In  [10]  it  has  been  shown  that  in  a  data-flow  graph  in  which 
there  is  a  directed  path  between  every  pair  of  time-delay  nodes 
(e.g.,  a  Gray-Markel  lattice  digital  Alter  [19])  the  total  number 
of  loops  equals 

"  N\ 

k(N-ky  (5) 

where  N  is  the  total  number  of  time-delay  nodes  in  the  data¬ 
flow  graph.  An  algorithm  that  finds  the  minimum  sampling 
period  Tmin  of  a  Gray-Markel  lattice  filter  using  (1)  must 
perform  a  total  of  M\  —  G  ■  L  computations,  where  G  is  the 
average  number  of  arithmetical  nodes  in  the  loops  of  the  data¬ 
flow  graph.  In  the  Gray-Markel  lattice  filter  Ga(JV  +  13)/3. 

As  remarked  in  [10]  the  minimum  sampling  oeriod  rmi„ 
can  be  found  by  adapting  the  algorithm  for  the  minimal 
cost-to-time  ratio  cycle  problem  presented  in  [12].  The  total 
number  of  computations  required  by  this  program  is  M2  = 
N3  ■  log2(2 N3F)  +  N  ■  E,  where  E  is  the  total  number  of 
edges  in  the  data-flow  graph,  and  F  is  die  maximum  number 
of  aritnemetical  nodes  between  a  pair  of  time-delay  nodes.  In 
the  Gray-Markel  lattice  filter  E  -  6  ■  N  and  F  =  N  +  2. 

Table  I  shows  the  values  of  Mi  and  M2  for  the  Gray-Markel 
lattice  digital  filter.  Notice  that  it  is  only  when  N  >  7  that 
M2  <  Mi.  This  fact,  along  with  the  fact  that  most  digital 
filter  structures  (unlike  the  Gray-Markel  lattice)  do  not  possess 
directed  paths  between  all  pairs  of  time-delay  nodes,  and  hence 
typically  possess  a  far  smaller  total  number  of  loops  than 
indicated  by  (5),  accounts  for  the  quite  acceptable  performance 
of  our  algorithm,  when  employed  for  “real  problems,”  even 
though  in  principle,  one  could  expect  to  sometimes  encounter 
unreasonably  long  computation  times. 

VII.  Appendix  B 

The  function  5  is  defined  recursively,  for  all  positive 
integers,  by 


S(M)S{N  -  M )  for  N  >  2.  (6) 


We  shall  prove  that  S(N )  =  (2 N  -  3)!!,  for  N  >  2, 
where  we  define  *!!  =  i(i  -  2)(*  -  4) ...  5  •  3  •  1  for  i  odd. 
The  proof  is  by  induction.  That  is,  for  all  J  =  1, . . . ,  N  -  I, 
we  assume  that  S(J)  =  (2 J  -  3)1!  and  we  shall  show  that 
S(N)  =  {2N  -  3)  •  S(N  -  1). 

Proof:  Using  the  property  of  binomial  coefficients 

C)  -  (2:1)  ♦  (V) 

we  find  from  (6) 

s<">  -  j  e[(£: i) + ("*  ‘)]s(M)S(Ar  -  ")• 

M— l1* 

Define  M'  =  N  -  M,  then 

M'=l 

Using  the  property  of  binomial  coefficients 

-  (V)  CD 

we  find 

SW  =  E  (NM  1 )  S{M)S(N  -  M).  (8) 

M=1  '  ' 

Let 

S{N  -  M)  =  (2 N  -  2M  -  3)  •  S(N  -  M  -  1) 

for  1  <  M  <  N  -  2. 


Then  (8)  becomes 

sw=(n 

+  il  (NM  1  ){2N  ~m~  -l-M). 

M=1  '  ' 

Define  Af'  =  N  -  1  -  M  and  split  the  summation  into  two 
identical  parts,  then 

S(N)  =  S(N  -  1)+ 

5iE(^1)(w‘2M'3)x 

S{M)S{N  -  1  -  M)  + 

N-2  ,  N  _  j  . 

M.pM'  -  1  )S(N -l  -  M')S(M')). 


5(1)  =  1 
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TABLE  D 


Name 

Input  1 

Input  2 

Comments 

A-Name 

Input  1 

Input  2  ; 

Adder 

M-name 

Input 

Coef.  ; 

Multiplier 

T-name 

Input 

; 

Time  delay 

Y 

Output 

;  Output  node,  and  last 

line 

TABLE  III 


Name 

Input  1 

Input  2  : 

Comments 

Ml 

T1 

bl 

M2 

T2 

b2 

M3 

A1 

a0 

M4 

T1 

al 

M5 

T2 

a2 

A1 

Ml 

A2 

A2 

M2 

X  ; 

X  =  input  node 

A3 

M3 

A4 

A4 

M4 

M5 

T1 

A1 

T2 

T1 

Y 

A3 

Using  property  (7)  and  taking  the  corresponding  terms  together 
we  have 

S(N)  =  S(N  -  1)4- 

(2N  -  4)  ■  \  £  (N  '  1 )  S(M)S(N  -  1  -  M). 

M= 1  '  ' 

We  recognize  the  summation  as  5(7V  -  1),  so 
S(N)  =  (2N  —  3)  •  S{N  —  1) 
which  completes  the  proof. 

VIII.  APPENDIX  C 

The  program’s  input-file  entries  are  of  the  form  shown  in 
Table  II. 

The  input  file  for  the  second-order  direct  form  II  filter 
example  is  shown  in  Table  III. 
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Abstract 

A  technique  using  Jacobian  elliptic  functions  is 
given  which,  oy  removing  a  previous  method’s  double¬ 
zero  constraint,  yields  improved  designs  of  linear  phase 
IIR  filters. 


Introduction 

In  theory  it  is  possible  to  implement  linear  phase 
IIR  filters  as  a  tandem  connection  of  an  arbitrary 
transfer  function  H(z)  and  a  time-reversed  version  of 
the  same  function  H(z~1).  Powell  and  Chau  have  de¬ 
vised  a  clever  technique  for  doing  this  by  approximat¬ 
ing  a  local  (in  time)  time-reversal  operation  using  two 
copies  of  a  desired  IIR  filter,  several  blocks  of  stor¬ 
age  registers,  and  control  circuitry  that  accesses  two 
of  these  register  blocks  on  a  “last-in,  first-out”  (LIFO) 
basis.  There  is,  however,  a  certain  inefficiency  in  the 
use  of  identical  transfer  functions  in  the  H(z~l)H(z) 
cascade  because  the  stopband  transmission  zeros  all 
appear  as  “double  zeros”  in  any  such  system.  We 
have  devised  a  design  technique,  using  Jacobian  el¬ 
liptic  functions,  for  an  optimal  pair  of  transfer  func¬ 
tions  which,  when  cascaded  as  Hi[z~l)H2(z)  in  the 
same  type  of  IIR  time-reversal  system,  will  meet  the 
same  linear-phase  design  specifications  as  that  of  (1), 
but  which  also  yields  an  additional  6  dB  of  stopband 
loss.  This  extra  loss  can  be  traded  for  lower  passband 
ripple  and/or  a  narrower  transition  band  by  a  simple 
revision  of  the  design  specifications. 

The  technique  of  [1]  of  course  yields  only  an  ap¬ 
proximate  linear-phase  design  because  it  happens  that 
certain  errors  are  inevitable  in  the  implementation  of 
the  time-reversal  process  due  to  the  finite  length  of 
the  register  blocks  and  the  infinite  length  of  the  niter’s 
impulse  response.  For  large  enough  register  blocks  the 
approximation  can  be  Quite  acceptable,  but  this  re¬ 
quired  length  grows  as  the  filter  specifications  become 
more  demanding.  When  the  register-block  length  is 
not  quite  sufficient,  errors  in  the  passband  magni¬ 
tude  and  phase  response  occur.  Our  investigations 
indicate  that  these  effects  are  less  pronounced  in  our 
H\(z~l)H-2{z)  design  than  they  are  in  the  design  of 
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[1];  thus,  this  represents  another  advantage  for  our 
modified  approach. 

To  illustrate  our  design  technique  consider  the  Fig. 
1(a)  pole-zero  plot  of  a  iowpass  filter  H(z).  The  corre¬ 
sponding  pole-zero  pattern  of  H{z~x)  is  shown  in  Fig. 
1(b),  and  the  overall  filter  designed  as  in  [1]  would  have 
the  pole-zero  pattern  shown  in  Fig.  1(c).  In  principle, 
a  transfer  function  having  the  same  set  of  Fig.  1(c) 
poles,  but  whose  zeros  (while  confined  to  the  unit  cir¬ 
cle)  were  not  constrained  to  have  order  two,  would  be 
better  able  to  distribute  the  zeros  throughout  the  fil¬ 
ter’s  stopband,  thereby  yielding  a  more  efficient  trans¬ 
fer  function.  Such  a  filter,  whose  pole-zero  pattern 
could  take  the  form  of  Fig.  1(d),  would  still  possess  a 
linear  phase  frequency  response. 

To  improve  the  system  given  in  [1]  we  must  find 
a  way  to  modify  the  elliptic  filter  design  method  to 
meet  given  passband  and  stopband  specifications,  such 
that  it  uses  2n  simple  zeros,  distributed  throughout 
the  stopband,  and  uses  n  (i.e.,  half  as  many)  double 
poles  (thus,  a  total  of  2n  poles)  inside  the  unit  cir¬ 
cle.  Then,  this  collection  of  poles  and  zeros  can  be 
allocated  to  two  transfer  functions  Hx(z),  H](z)  both 
having  identical  sets  of  (simple)  poles.  If  we  then  build 
a  filter  with  a  transfer  function  H(z)  =  Hi(z~1)Hi(z) 
it  will  have  the  pole-zero  pattern  of  Fig.  1(d).  Its 
implementation  could  use  the  same  structure  given  in 
(lj;  however,  while  an  H(z)H(z~l)  system  has  linear 
phase  for  any  H(z),  its  Fig.  2  generalization  has  linear 
phase  for  all  H\(z),  Hj(z)  where  (1)  Hi  and  have 
identical  poles,  and  (2)  the  zeros  of  H\  and  all  lie 
on  the  unit  circle. 

We  have,  in  fact,  found  a  method  to  design  opti¬ 
mal  transfer  functions  Hi(z),  J/2(^)  yielding  a  Fig. 
2  filter  with  an  equal-ripple  passband  and  stopband. 
The  functions  H{z)  —  //'i(ar_  1  )/f2(r)  are  first  con¬ 
structed  as  their  equivalents  H(s)  =  Hi(—s}Hi(s)  in 
the  conventional  analog  variable  s  =  £  +  jffl,  where 
r  =  (1  +  sT)/(  1  -  sT).  Reversing  the  sign  of  s  corre¬ 
sponds  to  taking  the  reciprocal  of  z.  The  equal-ripple 
passband  is  normalized  to  0  <  fl  <  1,  and  the  lower 
edge  of  the  equal-ripple  stopband  is  then  at  (2,. 

We  require  Hi(s)  and  H j(s)  to  have  identical  left- 
half  s-plane  poles,  so  the  poles  of  Hi(-t)  must  lie  in 
the  right-half  s-plane  at  the  negatives  of  the  poles  of 
H2(s).  The  poles  of  H(s)  are  thus  the  zeros  of  an  even 
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polynomial.  The  zeros  of  H($)  must  be  simple  and  lie 
.on  the  jd  axis  in  the  stopband  Q,  <  Q  <  oo.  It  follows 
'  that  H(jQ)  is  purely  real  and  of  even  degree. 

If  Hi(s)  and  Hi(s)  had  identical  zeros  as  well  as 
identical  poles,  then  we  should  have  Hi(a)  =  #3(5), 
and  H(s)  would  reduce  to  the  conventional  even  func¬ 
tion  Hi(-s)Hi{s)  which  is  used  in  constructing  a  com¬ 
plex  filter  function  Hi(s)  to  have  a  prescribed  loss 
while  ignoring  the  phase.  We  exploit  this  observation 
by  using  very  much  the  same  tools  for  constructing 

H(s)  as  one  would  for  H\(-s)H1(s)  except  for  mak¬ 
ing  it  have  simple  rather  than  double  jft-axis  zeros. 


A  Modified  Elliptic  Filter  Design 

We  consider  first  the  simplest  case  where  all  the 
zeros  of  H(s)  are  at  infinity,  and  where  we  do  have 
H\(s)  =  Then  H\(s)  is  identical  to  the  stan¬ 

dard  Chebyshev  lowpass  filter.  This  H(s),  x>f  degree 
n  =  2m,  can  be  constructed  parametrically  in  the  clas¬ 
sic  way  with  circular  functions  as: 

-r— —  =  1  +  f2  cos2  m6  and  fl  =  cos#  (1) 

H(s) 


The  passband  ripple  of  H(s)  (not  of  H\{s))  is  op  — 
201og10(l  +  e2)  dB 

However,  if  we  try  to  generalize  (1)  as  it  stands  to 
the  elliptic- function  case,  we  would  still  retain  Aj(s)  = 
Hj(s)  and  H(s)  would  have  double  jQ-axis  zeros.  To 
circumvent  this  we  rearrange  (1),  replacing  cos2  m6  by 
(cos  2 m0  -f  l)/2  and  2m  by  n  to  get 

—  =  1  +  e2(cos  n0  +  l)/2  =  — (I  +  t  cos  nO) 
H{s)  K  1  ~t 

where  1+c2  =  (1+0/0  “O'  For  digital  filters  we  can 
discard  the  factor  (1  —  t)~l  yielding 


-7 -  =  1  + 1  cos  n6  and  fi  =  cos#  (2) 

H(s) 

with  passband  ripple  ap  =  201og10(l  -f  t)/(l  - 1)  dB. 

It  is  easily  confirmed  that  the  poles  of  H(s)  lie  on 
an  ellipse  in  the  s-plane  at  the  points 


So 


.  (2a  —  l)ir  ...  (2<r-l)jr 

7  sm  - - - — h  jo  cos  - - 

n  n 

1,  2,  ....  n) 


(3) 


where 


7  =  sinh#3,  S  =  cosh 02,  expn#2  =  t_1  +  {t~2  —  1)*. 


The  poles  of  Hi(s)  =  Hjis)  are  the  left-half  s-plane 
poles  of  H(s)\  their  zeros  are  all  at  infinity. 

To  distribute  simple  zeros  over  the  stopband,  in¬ 
stead  of  having  them  all  at  infinity,  we  merely  replace 


the  cosine  function  in  (2)  by  its  appropriate  Jacobian 
elliptic  function  equivalent,  namely  the  cd(u;k)  func¬ 
tion  (not  the  cn(u;  k)  function).  Eq.  (2)  then  becomes: 


1 

H(s) 


=  1  +  1  cd(nuA'i/A‘  : 


*>] 


and  0  =  cd  (u  ;  k) 


(4) 

where  k  =  fl~ 1 .  The  Jacobian  functions  are  doubly 
periodic.  When,  as  in  the  present  application,  the 
modulus  k  is  real  and  0  <  k  <  1,  one  period  is  real 
with  quarterperiod  K  and  one  is  imaginary  with  quar- 
terperiod  K' .  The  quarterperiods  for  elliptic  functions 
are  the  counterparts  of  tr/2  for  circular  and  hyperbolic 
functions.  Varying  k  changes  both  K  and  the  ratio 


K'/K. 

In  order  to  make  the  parametric  equations  in  (4) 
describe  a  rational  function  with  the  required  charac¬ 
teristics,  it  is  necessary  that  the  modulus  ki  (whose 
associated  quarterperiods  are  A'i  and  K[)  and  the 
scale  factor  nK\/K  on  u  be  chosen  so  that,  from 
the  viewpoint  of  the  variable  u,  the  quarterperiod 
rectangle  of  \/H(s)  fits  exactly  n  times  along  the 
real  axis  into  the  quarterperiod  rectangle  of  the  0, 
but  only  once  along  the  imaginary  axis.  That  is,  so 
that  when  u  =  A',  nuA'i/A  becomes  nA’i,  while 
when  u  =  jK' ,  nuK\/I\  becomes  jK\.  This  re¬ 
quires  the  modulus  kx  to  be  related  to  it  so  that 
nK'/K  =  K[/l\\ .  This  can  be  expressed  alterna¬ 
tively  by  qn  =  Qi  in  terms  of  the  parameter  q  belong¬ 
ing  to  tne  closely  related  Theta  functions  ana  defined 
by  q  =  exp(-TA''/A). 


Over  the  stopband,  Q,  <  fi  <  00,  1  /H(s)  has  sim¬ 
ple  poles  at  which  it  changes  sign.  At  (l,  and  at  the 
turning  points  between  the  poles,  the  elliptic  function 
has  the  value  In  an  interval  between  two  adja¬ 

cent  poles  where  the  function  is  positive,  the  minimum 
loss  is  201ogIO(l  +  t/k  1)  dB,  whereas  when  it  is  nega¬ 
tive  the  minimum  loss  is  20  logI0(t/fcj  —  1)  dB.  In  any 
practical  filter  l/k  1  will  be  much  larger  than  unity,  so 
the  minimum  loss  can  be  approximated  by: 


o,  =  201og]0  ■—  dB  (5) 

*1 

The  exact  minima  of  the  loss  will  be  alternately 
slightly  higher  and  slightly  lower  than  (5),  but  the 
departures  from  (5)  are  quite  small.  For  example,  at 
40  dB  loss  they  are  less  than  0.1  dB,  and  at  60  dB  less 
than  0.01  dB. 

At  the  outset  in  a  design,  ap,  a,  and  k  are  pre¬ 
scribed  and  we  need  to  find  the  lowest  degree  n  of 
filter  that  meets  this  specification.  As  n  must  be  an 
integer,  even  its  lowest  permissible  value  will  usually 
provide  some  margin  in  performance,  and  the  param¬ 
eters  can  then  be  readjusted  to  distribute  this  margin 
over  op,  a,  and  k.  Eq.  (5).  combined  with  some  way 
of  computing  k\  from  k  and  n,  gives  the  relationship 
between  theTour  quantities  concerned. 

By  expanding  into  a  power  series  in  = 
exp(-7rA'',/Ai),  and  then  replacing  <71  by  qn ,  we  get 
Jfc?  =  169"  -  12893"  +  70493n  -  . . ..  When  ap  <  0.1  dB 
and  o,  >  20  dB,  91  =  ?"  <  2.1  x  10-8  and  it  is  clear 
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that  the  first  term  in  the  series  is  a  more  than  adequate 
practical  approximation  to  i fcj.  Substituting  4 qnl7  for 
k\  in  (5)  leads  to  the  design  equation 

a,  =  n  101og10  -  +  201og10<  -  12.04  dB  (6) 
9 


The  parameter  q  depends  only 
as  follows: 

Let  k  =  sin  <j>. 

If  4>  <  45°  q  =  l- 

lf  4>  >  45°  q'  =  l- 
and 


upon  k  and  is  computed 


1  -  v'cos  <j> 
1  +  v"cos  <f> 

1  -  \/sin  4> 
1  +  \/sin  4> 


q  =  exp 


This  gives  q  accurate  to  at  least  1  part  in  10s. 

When  k,  ap  and  n  have  been  chosen,  there  remains 
only  the  calculation  of  the  complex  poles  and  jfl-axis 
zeros  of  H(s )  to  complete  the  design.  The  simplest 
and  most  accurate  way  of  doing  this  is  via  a  sequence 
of  Landen  transformations  from  the  corresponding  cir¬ 
cular  functions.  The  Landen  transformation  is  an  al¬ 
gebraic  relation  between  certain  elliptic  functions  be¬ 
longing  to  two  different  modulus  values  with  the  ratio 
I\'JK  for  one  modulus  twice  that  for  the  other.  Iter¬ 
ating  the  transformation  soon  produces  a  modulus  so 
small,  and  for  which  the  ratio  K'/K  is  so  large,  that 
the  elliptic  functions  belonging  to  it  have  degenerated 
into  (are  numerically  indistinguishable  from)  circular 
functions.  Working  backwards  along  this  chain  of  de¬ 
scending  moduli  one  can  then,  step  oy  step,  transform 
the  circular  functions  into  the  desired  elliptic  func¬ 
tions. 

Let  us  denote  the  initial  modulus  k  =  £l~l  by  fc0 
and  the  moduli  obtained  by  successive  Landen  trans¬ 
formations  by  Jbi,  kz,  _  The  k(  are  related  by 

ki+ 1  =  jfcj/  ^1  +  \J\  —  kj'j  j  .  If  the  arithmetic  is  car¬ 
ried  out  to  d  decimal  digits,  then  the  transformations 
are  stopped  when  kr  <  10~rf. 

Next,  we  find  the  reciprocals  of  the  complex  poles 
s„  given  in  (3)  for  the  circular-function  case  to  which 
the  elliptic  functions  have  been  reduced  after  r  steps 
of  the  Landen  transformation.  Let  ar  +  jbr  =  sjl. 
After  r  transformations  using 


di-i  +  jbi-i  = 


1 

1  +  k{ 


a.  +  jbi  - 


kj 

a.  +  jbi. 


(7) 


we  get  a0  4-  jb0  which  is  the  reciprocal  of  the  corre¬ 
sponding  pole  of  H(s).  This  need  be  done  of  course 
only  for  the  left-half  s-plane  poles  of  H(s). 

Finally,  the  jfi-axis  zeros  of  H{s)  are  given  by 


_ ±1 _ 

k  nl  l(‘2rr  -  \)K/n\  k ] 


(a  =  1,  2 _ n/2) 


We  use  the  same  chain  of  moduli  4,  as  for  the  poles, 
and  (7)  with  a,  =  0  and  br  =  l/cos[(2»  -  l)*/(2«)]. 
Starting  with  ar  =  0  causes  all  a,-  to  vanish,  and  (7) 
simplifies  to 


After  r  steps  using  (8),  fl„  =  bo/k. 

The  zeros  of  H{s)  have  to  be  split  into  two  groups, 
one  belonging  to  H\(s)  and  one  to  The  sim¬ 

plest  approach  is  to  arrange  that  the  zeros  of  Hi(e) 
and  Hz(s)  interlace,  but  slight  digressions  from  this 
may  in  some  cases  prove  advantageous.  We  note  that 
the  separate  functions  Hi(s)  and  Hz(s)  will  not  have 
fiat  passbands,  although  their  tandem  connection  will. 
When  n/2,  the  common  degree  of  H\  and  Hz,  is  odd, 
both  functions  must  have  a  zero  at  infinity.  This  re¬ 
quires  H  to  have  two  zeros  at  infinity.  But  H  defined 
by  (4)  has  n/2  conjugate  jfl-axis  pairs  of  zeros  and 

none  at  infinity.  As  H  is  an  even  function  of  s  we 
correct  this  by  making  a  bilinear  transformation  on 
n2  (=  — s2)  moving  to  infinity  the  conjugate  pair  of 
zeros  nearest  infinity,  while  keeping  the  passband  edge 
frequencies  Q2  =  0  and  fi2  =  1  fixed.  The  transforma¬ 
tion  is  fi2  =  ft2  (fi2,  -  l)/(«2j  _  Q2)  where  Q  is  the 
transformed  frequency  and  Qm  is  the  largest  zero  of 
H  as  given  by  (4).  This  will  increase  slightly  the  stop- 
band  edge  from  k~l  to  [ifc  cd  (K/n  ;  it)]-1.  We  com¬ 
pensate  by  chosing  k  in  (4)  so  that  k  cd(/f/n  ;  k)  = 
fij 1  instead  of  it  =  Clj 1 .  The  value  of  k  can  easily  be 
found  via  the  iteration  kr+\  =  ko/cd  ( K/n  ;  Jtr),  start¬ 
ing  with  *o  =  ft?1-  This  transformation  will  reduce 
slightly  the  6.02  dB  increase  of  loss  we  would  otherwise 
get;  the  reduction  varies  from  1.075  dB  when  n  =  10 
to  0.357  dB  when  n  =  30.  Hi(z)  and  can  be 

found  by  using  z  =  (1  +  sT)/(l  -  sT). 

Example 

Fig.  3  shows  an  example  from  [1].  Here  the  result 
corresponding  to  |/fi(r-l)^fj(z)|,  for  z  =  e>w,  using 
our  modified  design,  is  plotted  as  a  dashed  curve,  while 
the  corresponding  plot  for  \H (z)|2  from  [1]  (example  3 
in  Table  I)  is  the  solid  curve.  As  expected,  our  design 
yields  an  improved  stopband.  We  have  proved  that 
we  always  obtain  a  stopband  having  approximately  6 
dB  (k  0.7  nepers)  more  attenuation  than  we  obtain 
by  the  technique  of  [1]  employing  |/f(z)|2. 

In  Fig.  4  we  examine  the  passband  errors  for  the 
Fig.  3  example.  Notice  that  compatible  scales  have 
been  used  in  both  Fig.  4(a)  and  Fig.  4(b).  that  is,  we 
measure  the  gain  in  nepers  and  the  phase  in  radians. 
(Notice  also  the  IO-5  factor  on  both  vertical  axes  in 
Fig.  4.)  These  results,  which  are  typical  of  those  for 
all  examples  considered,  indicate  that  the  non-ideal 
passband  errors  are  less  pronounced  in  filters  resulting 
from  our  Hi(z~l)Hz[z)  design  technique  than  they  are 
for  those  obtained  through  the  design  technique  of  [1]. 
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Concluding  Remark 

The  increase  in  stopband  loss  of  approximately  6 
dB,  caused  by  separating  the  double  zeros,  may  be 
reminiscent  of  a  similar  feature  in  the  relationship  be¬ 
tween  certain  minimum-phase  and  linear-phase  FIR 
filters,  as  described  in  [2].  From  our  perspective  here, 
however,  this  similarity  is  superficial;  the  FIR  filter 
design  techniques  of  [2]  unfortunately  provide  no  help 
in  solving  our  IIR  filter  design  problem. 

Acknowledgment 

We  would  like  to  acknowledge  the  work  of  Dr  M. 


Werter,  who  wrote  the  computer  program  that  simu¬ 
lates  the  filters  described  in  this  paper. 


References 

(1]  S.  R.  Powell  and  P  M.  Chau,  “A  technique  for 
realizing  linear  phase  HR  filters  "  IEEE  Transac¬ 
tions  on  Signal  Processing ,  vol.  39,  pp.  2425-2435, 
Nov.  1991. 

[2]  0.  Herrmann  and  H.  W.  Schiissler,  “Design  of 
nonrecursive  digital  filters  with  minimum  phase," 
Electron.  Lett.,  vol.  6,  pp  329-330,  1970. 


TRANSACTIONS  ON  CIRCUITS  AND  SYSTEMS— t  MOAMBNUU.  THBORY  AW  ASIUDmOML  VOL  4ft.  NO.  T.  JULY  NSS 


The  Design  of  Two-Channel  Latttee- 
Structure  Perfect-  Reconstruction  Filter 
Banks  Using  Powers-of-Two  Coefficients 
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Abstract — An  optimization  technique  b  presented  for  the  design  of  two- 
channel  lattice-structure  perfect-reconstruction  filter  banks  with  powers- 
of-two  coefficients.  The  filter  coefficients  are  represented  by  a  canonic 
signed -digit  (CSD)  code.  The  proposed  technique  requires  the  original 
optimal  infinite-precision  coefficients  as  the  starting  point,  and  searches 
for  the  set  of  CSD  coefficients  that  minimizes  the  peak  stopband  ripple. 
Design  examples  are  given  to  show  that  perfect  -  reconstruction  filter  banks 
with  good  filtering  performance  can  be  obtained. 

I.  Introduction 

Multirate  analysis/synthesis  filter  banks  find  application  in  many 
areas  [1].  Recently,  much  attention  has  been  given  to  the  design 
of  multiplierless  filter  banks  with  applications  to  subband  coding 
12]— [4],  In  [3],  three  sets  of  short-tap  filter  banks  with  powers-of- 
two  coefficients  were  derived  by  judiciously  factoring  a  seven-tap 
half-band  product  filter.  Although  the  perfect-reconstruction  property 
is  preserved  for  these  filter  banks,  the  coding  gain  is  poor  due 
to  poor  filtering  performance  [5].  In  [4],  a  canonic-signed  digit 
(CSD)  code  search  technique  was  used  to  design  multiplierless 
filter  banks  with  good  filtering  performance  at  the  expense  of  the 
perfect-reconstruction  property.  Although  this  technique  can  achieve 
negligible  signal-reconstruction  error  in  practical  applications  [4], 
the  design  of  perfect-reconstruction  filter  banks  with  good  filtering 
performance  using  powers-of-two  coefficients  is  yet  open  and  of 
great  interest. 

Recently,  several  novel  lattice-structure  perfect-reconstruction  filter 
banks  have  been  reported  [6H9].  One  desirable  feature  of  these 
lattice-structure  filter  banks  is  that  the  perfect-reconstruction  property 
is  preserved,  even  under  the  quantization  of  the  lattice  coefficients. 
This  feature  opens  the  door  to  the  design  of  multiplierless  perfect- 
reconstruction  filter  banks  with  good  filtering  performance  since  we 
need  only  to  find  the  set  of  CSD  lattice  coefficients  yielding  the 
desired  filtering  performance.  However,  not  all  of  these  filter  banks 
are  well  suited  for  CSD  design.  As  reported  in  (8],  the  dynamic 
range  of  the  optimal  coefficients  is  too  wide  for  lattice-structure 
perfect-reconstruction  filter  banks  employing  linear-phase  filters.  A 
prohibitive  number  of  digits  would  be  repaired  to  implement  such 
filter  banks  using  fixed-point  arithmetic  with  current  technologies. 
Therefore,  these  filters  are  not  suitable  for  CSD  design.  The  filter 
banks  in  [6],  however,  do  have  a  good,  small  coefficient  dynamic 
range,  and  should  be  good  candidates  for  CSD  design.  In  this  paper, 
we  examine  the  use  of  an  optimization  technique,  which  adopts  a 
two-stage  local  search  strategy  over  the  CSD  code  [10],  to  optimize 
the  performance  of  such  filter  banks.  Such  designs  should  lead  to 
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computationally  efficient,  multipheriess  perfcd-recocsaucw*  Star 
banks  which  should  be  more  useful,  ia  practice,  dun  the  anginal 
infinite-precision  design  [6]. 

D.  Optimization  algorithm 

Before  formulating  the  optimization  procedure,  let  us  make  some 
observations. 

1)  Although  the  original  infinite-precision  lattice  filter  banks  have 
monotonically  decreasing  stopband  peak  error  [6],  our  com¬ 
puter  simulations  show  that  after  rounding  the  lattice  coeffi¬ 
cients  to  the  nearest  CSD  code,  the  peak  error  in  the  stopband 
is  no  longer  monotone  decreasing.  Therefore,  the  original 
criterion  of  minimizing  the  stopband  integrated  squared  error 
would  be  inappropriate  here.  Furthermore,  the  original  lattice 
filter  banks  have  low  passband  sensitivity.  Thus,  a  reasonable 
objective  function  to  be  minimized  would  simply  be  the  peak 
error  in  the  stopband  of  the  lowpass  filter: 

=  max  \H0(c1-J)\.  (1) 

2)  The  impulse  response  coefficients  of  Ho(:)  and  H i ( -  )  are 
products  of  the  lattice  coefficients;  thus,  the  shape  of  the 
frequency  response  would  be  affected  if  we  scale  the  lat¬ 
tice  coefficients.  In  other  words,  large  scaling  on  the  lattice 
coefficients  would  not  improve  the  frequency  response.  Our 
computer  simulations  show  that  only  a  little  fine  scaling  of  the 
original  optimal  point  might  help.  Furthermore,  the  quantization 
error  of  each  lattice  coefficient  would  accumulate  on  its  cor¬ 
responding  impulse  response  coefficient.  Therefore,  we  would 
like  to  make  the  quantization  error  of  each  lattice  coefficient 
as  small  as  possible.  One  possibility  is  to  search  for  only  the 
fractional  part  of  each  lattice  coefficient  which  would  limit 
the  quantization  error  within  the  fractional  pan  of  each  lattice 
coefficient.  This,  of  course,  fails  to  decrease  the  number  of 
nonzero  digits  for  those  lattice  coefficients  with  integer  part 
However,  as  can  be  observed  in  [6],  such  coefficients  tend 
to  be  few  in  the  original  optimal  infinite-precision  coefficient 
design,  and  thus  the  hardware  penalty  is  minor. 

Based  on  the  above  observations,  we  propose  the  following  opti¬ 
mization  procedure. 

1 )  The  optimal  infinite-precision  lattice  coefficients  in  [6]  are  used 
as  the  starting  point. 

2)  The  computer  program  of  [10]  is  modified  to  search  for  the 
set  of  CSD  lattice  coefficients  such  that  (I)  is  minimized. 
Notice  that  we  only  search  for  the  fractional  part  of  each  lattice 
coefficient,  and  only  fine  scaling  on  the  original  optimal  lattice 
coefficients  is  performed. 

III.  Design  Examples 

Example  I:  The  filter  bank  denoted  32E  in  [6]  was  designed.  The 
low-pass  magnitude  response  plots  of  the  original  32E  filter  and  the 
CSD  design  are  shown  in  Fig.  1.  As  reported  in  [6],  the  original 
32E  has  monotonic  decreasing  stopband  ripples,  with  the  first-peak 
stopband  attenuation  of  25  dB,  whereas  the  CSD  design  has  a  quasi¬ 
equal -ripple  stopband  with  a  minimum  stopband  attenuation  of  29 
dB.  Two  nonzero  digits  were  chosen  for  the  optimization.  The  CSD 
code  of  the  lattice  coefficients  is  shown  in  Table  1.  It  is  interesting  to 
observe  that  only  a  single  adder/subtractor,  on  average,  is  required  to 
implement  each  lattice  coefficient  for  this  example. 
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FREQUENCY  (cydestample) 

Fig.  I .  The  low-pass  magnitude  response  plots  of  the  original  lattice-structure 
filter  bank  32E  and  CSD  design  in  Example  I. 


FREQUENCY  (cyctciAmptc) 

Fig.  2.  The  low-pass  magnitude  response  plots  of  the  original  lattice-structure 
filter  bank  48F  and  the  CSD  design  in  Example  2. 


TABLE  I 

CSD  Code  of  otll  in  Example  I 


Example  2:  The  filter  bank  denoted  48F  in  [6]  was  designed. 
The  low-pass  magnitude  response  plots  of  the  original  48F  filter 
and  the  CSD  design  are  shown  in  Fig.  2.  As  reported  in  [6],  the 
minimum  stopband  attenuation  of  the  original  48F  filter  is  70  dB. 
For  such  a  large  attenuation,  more  coefficient  precision  is  needed. 
Therefore,  more  nonzero  CSD  digits  are  required.  Five  nonzero  digits 
were  chosen  for  the  optimization,  and  the  CSD  code  for  the  lattice 
coefficients  is  shown  in  Table  II.  The  CSD  design  has  a  minimum 
stopband  attenuation  of  71.6  dB.  Compared  to  the  optimal  equiripple 
design  by  Smith  and  Barnwell  [6],  [11],  where  a  minimum  stopband 
attenuation  of  72  dB  was  reported,  the  CSD  design  is  only  0.4  dB 
less.  The  CSD  design  requires  3.17  adders/subtractors,  on  average, 
to  implement  each  lattice  coefficient. 

IV.  Conclusions 

We  have  presented  an  optimization  technique  for  the  design  of 
two-channel  lattice-structure  perfect-reconstruction  filter  banks  with 


TABLE  II 

CSD  Code  of  o,„  in  Example  2 


CSD  coefficients.  The  two-stage  local  search  strategy  in  [  10]  has  been 
successfully  mod;fied  to  search  for  the  set  of  CSD  lattice  coefficients 
which  minimizes  the  stopband  peak  error  of  the  filter  banks.  Design 
examples  have  been  given  to  show  the  effectiveness  of  the  proposed 
algorithm. 
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Lagrange  Multiplier  Approaches  to  the  Design  of 
Two-Channel  Perfect-Reconstruction 
Linear-Phase  FIR  Filter  Banks 

Bor-Rong  Homg,  Member,  IEEE,  and  Alan  N.  Willson,  Jr.,  Fellow,  IEEE 


Abstract— Two  new  approaches  are  presented  for  the  design 
of  two-channel  perfect-reconstruction  FIR  filter  banks  employ¬ 
ing  linear-phase  filters.  We  first  formulate  the  optimization  of 
perfect-reconstruction  filter  banks  as  a  quadratic  program¬ 
ming  problem  with  linear  constraints,  and  then  as  one  with 
nonlinear  constraints.  Closed-form  solutions  for  the  first  ap¬ 
proach,  and  for  the  iteration  problem  in  the  second  approach 
are  obtained.  Design  examples  for  both  approaches  are  given. 

I.  Introduction 

ULTIRATE  filter  banks  are  used  in  applications 
such  as  speech  coding,  TDM-FDM  transmultiplex¬ 
ing,  and  image  coding  [1],  [2],  In  these  analysis/synthesis 
systems,  perfect-reconstruction  filter  banks  have  been  re¬ 
ported  recently  [3]-(6],  When  applied  to  low-rate  sub¬ 
band  image  coding,  the  symmetric  extension  method  [7], 
[8]  has  been  shown  to  outperform  the  circular  convolution 
method  [2],  and  to  yield  both  objective  and  subjective 
quality  improvement  at  image  boundaries.  The  symmetric 
extension  method  requires  linear-phase  analysis/synthesis 
filters;  therefore  perfect-reconstruction  filter  banks  with 
linear-phase  filters  are  desired  in  subband  image  coding. 
The  design  of  two-channel  perfect-reconstruction  filter 
banks  employing  linear-phase  filters  has  been  reported  re¬ 
cently  [4],  [4],  [10],  [11].  As  discussed  in  [9],  the  fac¬ 
torization  method  and  the  complementary  filter  method 
might  yield  filters  with  poor  quality.  Novel  lattice  struc¬ 
tures  are  reported  in  both  [10]  and  [1 1],  and  in  [11]  it  is 
reported  that  good-quality  filters  have  been  obtained  by 
optimizing  the  lattice  parameters. 

In  this  paper,  we  present  two  new  approaches  to  the 
design  of  two-channel  perfect-reconstruction  linear-phase 
FIR  filter  banks.  Both  approaches  analyze  and  design  on 
the  impulse  responses  of  the  analysis  filter  bank  directly. 
The  synthesis  filter  bank  is  then  obtained  by  simply 
changing  the  signs  of  odd-order  coefficients  in  the  analy- 
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sis  filter  bank.  Our  first  approach  deals  with  unequal- 
length  filter  banks.  By  designing  the  lower  length  filters 
first  we  can  take  advantage  of  the  fact  that  the  number  of 
variables  for  designing  the  higher  length  filters  is  more 
than  the  number  of  perfect-reconstruction  constraint  equa¬ 
tions.  We  thus  formulate  the  design  problem  as  a  quad¬ 
ratic  programming  problem  with  linear  constraints,  and 
we  use  the  Lagrange  multiplier  method,  as  described  in 
[12]  and  [13],  to  obtain  the  closed-form  solution  for  de¬ 
signing  the  higher  length  filters.  Our  second  approach 
generalizes  the  first,  and  covers  the  design  for  all  pairs  of 
linear-phase  perfect-reconstruction  analysis  filters.  It  for¬ 
mulates  the  design  problem  as  a  quadratic  programming 
problem  with  nonlinear  constraints.  The  Lagrange-New- 
ton  method  is  used  to  obtain  the  closed-form  solution  for 
the  linearized  iteration  problem  in  the  second  approach. 
Design  examples  for  both  approaches  are  given. 

A  generic  two-channel  FIR  filter  bank  is  shown  in  Fig. 
1 ,  where  H0  (z)  and  //,  (z)  represent  the  low-pass  and  high- 
pass  filters  in  the  analysis  bank,  respectively,  and  G0(z) 
and  G,(z)  art  the  synthesis  filters.  Assuming  perfect 
channels  and  codecs,  it  is  well  known  that  we  can  relate 
the  reconstructed  signal  x(n)  to  the  input  signal  x (n)  by 

*(z)  =  5[//oU)G0(z)  +  Hi(z)Gt(z)]X(z) 

+  \{H0(-z)G0(z)  +  //|(-z)G|(z)]X(-z). 
Furthermore,  by  choosing 


G0(zf 

2H,(-z) 

_G,(z)_ 

.  -2//0(-z)_ 

we  have 

X(z)  =  [Hc{z)H,(-z)  -  f/0(-z)W,(z)]X(z). 

If  we  impose  the  following  pure-delay  constraint 

H0(z)Ht(-z)  -  H,(z)Ho(-z)  =  z~2k  +  i  0) 

then 

X(z)  =  z~2*  +  ,X(z). 

Thus,  we  obtain  a  perfect-reconstruction  system  where  the 
output  Jc(n)  is  a  delayed  replica  of  the  input  x(n). 
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Fig.  1.  Two-channel  analysis  and  synthesis  filler  bank. 


Combining  the  pure-delay  constraint  (1)  and  the  linear- 
phase  condition,  it  has  been  shown  [4],  [11]  that  only  two 
types  of  systems  yield  nontrivial  analysis  filters: 

1)  both  filters  have  even  length  and  opposite  symme¬ 
try,  denoted  type  A  systems  in  [11]; 

2)  both  filters  have  oda  length  and  are  symmetric,  de¬ 
noted  type  B  systems  in  [1 1]. 

Using  terminology  defined  in  [14],  for  type  A  systems 
the  analysis  filters  are  either  case  2  or  case  4  since  the 
lengths  are  even.  It  should  also  be  noted  that  case  2  cannot 
realize  a  high-pass  filter.  Therefore,  Hl  (z)  must  be  case  4 
and  H0(z)  must  be  case  2. 

For  type  B  systems  it  is  obvious  that  both  H0(z)  and 
Hx  (z)  must  be  case  1 . 

Furthermore,  by  examining  the  pure-delay  constraint  in 
(1),  and  the  coefficient  symmetry /antisymmetry  of  linear- 
phase  filters,  we  can  make  the  following  observations  (as¬ 
suming  the  lengths  of  h0(n)  and  h,(n)  to  be  A0  and  A,, 
respectively): 

1)  the  sum  of  the  lengths  must  be  a  multiple  of  4  [1 1]; 

2)  the  number  of  independent  constraint  equations  in 
(1),  £,  is  given  by 


k  = 


A0  +  A, 
4 


(2) 


and  ( N0  +  A,)/2  -  1  is  the  system  delay; 

3)  the  constraint  equations  can  be  expressed  as 

(i  ~  —4 — )  =  (-U%(2/  -  1  -  k)h,(k) 


i  =  1.  2, 


N0  +  A, 


It  should  be  noted  here  that  for  type  B  systems  h0(n )  =  0 
for  n  >  N0  and  h|(n)  =  0  for  n  >  Nt.  By  adding  the 
coefficient  symmetry/antisymmetry  of  the  linear-phase 
filters  the  above  equations  can  be  expressed  in  the  matrix 
form 

Cy,  =  m  (3) 

where  y,  is  an  (/,  4-  l)-dimensional  column  vector 

y,  =  [M0)MU  ■  • ' 

m  is  an  (A0  +  Nt ) /4-dimensional  column  vector 
m  =  [0  •  •  •  0  If 


MS 


C is  an  ( N0  +  N,  )/4-by-(/t  +  1)  matrix  with  the  elements 
formed  by  h0(n),  n  =  0,  1,  •  •  •  ,  /0  and 


u 


N0  -  1 
2 


/,  = 


A,  -  1 


type  A 
type  B. 


The  next  section  addresses  the  formulation  and  design 
examples  of  our  first  approach,  the  Lagrange  multiplier 
method.  Section  III  addresses  our  second  approach,  the 
Lagrange-Newton  method. 


II.  Lagrange  Multiplier  Method 

This  method  deals  with  unequal-length  filter  banks; 
without  loss  of  generality,  we  assume  N0  <  N,.  The  de¬ 
sign  process  starts  with  the  design  of  //0(z),  which  is  case 
2  for  type  A  systems,  or  case  1  for  type  B  systems,  using 
any  desired  design  criterion.  Then,  its  coefficients  are  used 
as  known  variables  in  (1),  yielding  a  set  of  linear  con¬ 
straint  equations  for  designing  //,  (z).  We  recall  that  //,  (z) 
is  case  4  for  type  A  systems,  or  case  1  for  type  B  systems. 
Therefore,  by  defining 


we  can  express  the  zero-phase  frequency  response  of  H ,(z) 
as  a  scalar  product  [14] 


HUej “)  =  y[s,(u>) 


where 


r 


and 


=  < 


a,(l)  a,(2)  ■  ■  ■  a]  [~ 


L 


M 0)  b,(  1)  ■  b | 


u>  3« 


type  A 
type  B 


s,(u>)  =  < 


A,  -  1 


sin  —  sin  —  •  •  •  sin  — — 
2  2  2 


type  A 


u> 


,  „  A,  -  1 

1  cos  w  cos  2u  •  •  •  cos  — - —  w 


l  type  B. 
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The  objective  function  to  be  minimized  is 


*  =  2^r  [  1  [H*' {eJU)]2  du  +  L  11  ~  H* (eJU)]2  dw]  p  =  -I  r 

X  Jula 


and  the  elements  of  pT  =  [p0  p,  •  ■  •  -  0/2 J  are 

given  by 


cos  (iw)  dw,  0  sis 


N,  -  1 


=  2  >'^1  +  pTy  1  +  ^ 


where 


Q  =  -  [  f  S!(«)j[(w)  dw  +  f  s,(w)s[(w)  dw\ 
r  L  Jo  Ju,i  J 

pT  -  (  s[(w)  dw 

x  L 


For  type  A  systems  the  elements  of  (2  are  given  by 


/  =  0 


sin  (iwp, ) 


,  i  *  0. 


It  is  easy  to  reformulate  the  set  of  linear  constraint 
equations  in  terms  of  y,,  yielding  the  following  form: 

Cy,  =  m  (4) 

where  C  is  a  known  matrix,  for  a  given  H0(z).  Therefore, 
our  optimization  problem  for  designing  W|(z)  becomes 

min  4>  =  jy[£y,  +  pTy  1  +  d  subject  to 
Cy,  =  m. 


*'•  si"  f  (''  -  0  “] sin  [('  -  0  "]  +  L sin  [(■'  ‘  0 "] sin  [(;  _  0  "]  ^ 


1  ~  l'j  ~  2 

+  q)tl  -  alp,  sin  [(2i  -  l)o)P)l  -  sin  [(2/  -  1)^,] 

|  2ir  (4i  -  2)  ic 

)  sin  [((  -  j)wsl]  -  sin  [(/  -  J)wp[]  sin  [(/  +  j  -  ljw,,]  -  sin  [(/  +  j  -  l)wpl] 
L  2(;  -  j)ir  '  2(i  +  ;  -  1)t 


.  i  *  j 


and  the  elements  of  pT  =  [p\  p2  ’  •  •  Pny/ 2]  are  given 
by 


1  r  .  \f.  a  1 . 

cos  [('  ~  0 
(''  -  0  ’ 


(V, 


For  type  B  systems  me  elements  of  Q  are  given  by 


Following  the  technique  developed  in  [12],  [13],  we 
can  solve  this  design  problem  in  closed  form  by  the 
method  of  Lagrange  multipliers.  The  Lagrange  multiplier 
vector  is 

X  =  [X|  X2  •  •  •  X*]r 
and  the  Lagrangian  function  is 

A ( y, ,  X)  =  \y\Qy\  +  pTyt  +  d  -  Xr(Cy,  -  m). 


hj)  =  -[  J 

IT  JO 


r  ■)  a/;  —  1 

cos  (iw)  cos  (jw)  dw  +  J  cos  (iw)  cos  (jw)  dw  j,  0  <  i,j  <  — — — 


T  +  W,|  -  Wp] 


t  +  Uj,  -  wpi  sin  (2io)S| )  —  sin  (2iwpi ) 
2ir  +  4  ix 


i  -  j  —  0 


i  =  j  *  0 
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m 


Imposing  the  necessa'v  and  sufficient  conditions  for  the 
solution 


VVIA  =  0 


VXA  =  0 


we  arrive  at  the  following  system  of  linear  equations  for 
the  filter  coefficient  vector  and  Lagrange  multiplier  vec¬ 
tor: 


The  resulting  closed-form  solutions  are 
y,  =  Q-'CT(CQ-'CTr'm 


+  Q-'{CT(CQ']CTr'CQ-'  -  l]p 
X  =  ( CQ-'CTy'(m  +  CQ-'p). 


(5) 


(6) 

(7) 


Example  2.1:  A  type  B  system  with  N0  =  23  and  Nx 
=  25  was  designed.  We  first  designed  the  23-tap  low-pass 
filter  H0(z)  with  passband  edge  frequency  =  0.5*  and 
stopband  edge  frequency  a>j0  =  0,6*,  using  the  eigenfilter 
approach  [15].  We  chose  this  approach  so  that  H0  ( z )  and 
H,  (z)  would  both  be  designed  according  to  a  least  squares 
criterion.  Then,  the  coefficients  of  H0(z)  were  used  as 
known  variables  to  obtain  the  C  matrix  in  (4).  The  Q  ma¬ 
trix  and  p  vector  were  easily  calculated,  given  the  band- 
edge  frequencies  w,,  and  which  were  chosen  to  be 
0.4*  ar.J  0.6*.  The  coefficients  of  //,  <z)  were  then  ob¬ 
tained  by  the  simple  matrix  computations  in  (6).  The  coef¬ 
ficients  of  H0(z)  and  H ,  (z)  are  shown  in  Table  I.  The 
magnitude  response  plots  of  H0(z)  and  Hx(z)  are  shown 
in  Fig.  2.  The  choice  of  the  low-pass  filter  will  affect  the 
filtering  performance  of  the  resulting  high-pass  filter.  Our 
computer  simulations  showed  that  by  choosing  a  narrower 
transition  bandwidth  for  the  low-pass  filter  we  can  always 
obtain  the  high-pass  filter  with  good  frequency  response. 
In  fact,  since  the  design  of  the  low-pass  filter  involves 
only  the  computation  of  the  eigenvectors  for  a  matrix  de¬ 
termined  by  «jo  a°d  wpo-  and  the  design  of  the  high-pass 
filter  involves  only  the  simple  matrix  computation  in  (6) 
determined  by  u>si  and  wpl,  a  computer  program  has  been 
written  in  which  we  only  need  to  adjust  the  values  of  Wjo, 
oip o,  cOjj ,  and  wri  for  finding  the  appropriate  filters.  Our 
experiments  showed  that  with  only  a  few  tries  we  can  eas¬ 
ily  obtain  the  desired  filters.  Notice  that  we  have  12  de¬ 
sign  parameters  for  designing  the  low-pass  filter  H0(z)  and 
13  design  parameters  for  designing  the  high-pass  filter 
H,  (z).  The  number  of  perfect-reconstruction  constraints, 
according  to  (2),  is  12.  This  implies  that  by  designing  the 
iow-pass  filter  first,  we  have  only  one  degree  of  freedom 
left  for  designing  the  high-pass  filter.  However,  with  the 
help  of  the  Lagrange-multiplier  method  we  show  here  that 
even  with  only  one  degree  of  freedom  we  have  been  able 
to  design  good  filters. 

Example  2.2:  A  type  B  system  with  a  larger  difference 
in  filter  lengths,  N0  =  15  and  Nt  =  25,  was  designed. 


TABLE  I 

Impulse  Responses  of  the  Optimized  Analysis  Filters  in  Example  2. 1 


n  * 

M") 

*.<«) 

0 

0.3202207294 1022D-O2 

0. 20757249335743D-03 

1 

-0  162708088 1359  ID-01 

-0. 105470I349472ID-02 

2 

0. 260261954 30I09D-02 

0. 16928I83546876D-02 

3 

0.262I8446S66847D-0! 

-0.60446771 1097 14D-02 

4 

-0. 140240936 18472D-OI 

-0  9779I939000393D-03 

5 

-0.35 139065 1563 1 2D-0I 

0. 16845089340686D-01 

6 

0.3837523270171  ID-01 

-0.30423069328650D-02 

7 

0. 44088905465945  D-01 

-0  35790690426760D-0! 

S 

-0. 8896977 1578456D-OI 

C  II 75676049566 ID-01 

9 

-0  491 13286303353D-OI 

0  92205949206770D-01 

10 

0. 313037686851 60D-00 

-0. 1 1746190I24I99D-0I 

n 

0.55 198385409394  D-00 

-0.3I271750I39945D-00 

12 

0.51 1342 18298697  D-00 

The  number  of  constraints  for  this  system  is  10.  We  have 
13  design  parameters,  and  thus  3  degrees  of  freedom,  for 
designing  the  high-pass  filter.  The  choice  of  the  appro¬ 
priate  low-pass  filter  is  done  similar  to  that  of  example 
2.1.  The  low-pass  eigenfilter  H0{z)  with  =  0.46*  and 
oijo  =  0.6*  was  first  designed.  The  high-pass  filter  Hx  (z) 
with  Wj,  =  0.4*  and  =  0.6*  was  then  obtained  using 
(6).  The  coefficients  of  H0(z)  and  Hx  (z)  are  shown  in  Ta¬ 
ble  II.  The  magnitude  response  plots  of  H0(z)  and  Ht(z) 
are  shown  in  Fig.  3. 

Example  2.3:  In  this  example  we  designed  a  type  A 
system  with  Ai'0  =  16  and  N x  =  28.  The  number  of  con¬ 
straints  is  12.  Here  we  have  14  design  parameters,  and 
thus  2  degrees  of  freedom,  for  designing  the  high-pass 
hlter.  The  low-pass  eigenfilter  H0(z)  with  =  0.44* 
and  uIjo  =  0.6*  was  first  designed.  The  highpass  filter 
Hx  (z)  with  wsl  =  0.4*  and  wpl  =  0.6*  was  then  obtained. 
The  coefficients  of  H0(z)  and  //,  (z)  are  shown  in  Table 
III.  The  magnitude  response  plots  of  H0(z)  and  Hx  (z)  are 
shown  in  Fig.  4. 

In  order  to  demonstrate  the  perfect-reconstruction  prop¬ 
erty  of  the  proposed  Lagrange  multiplier  method,  a  com¬ 
puter  simulation  with  double-precision  arithmetic  was  run 
for  a  simple  ramp  input  sequence  x(n).  The  reconstructed 
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TABLE  II 

Impulse  Responses  of  the  Optimized  Analysis  Filteas  in  Example  2.2 


n 

Ao<») 

A,(n) 

0 

0. 19042472255677D-01 

-0.1 854688665 5979D-02 

1 

0.13764493664993D-01 

-0.1 34062689 158 15D-02 

2 

-  0.4502945944 1 763D-0 1 

0.7309 1436972806D-02 

3 

-0.22388494430843D-01 

0.4293699034433  ID-02 

4 

0.9 1404228 166495D-0I 

-0.26647270668886D-01 

5 

0.23010820706660D-0I 

-0. 13507852518241D-01 

6 

-0.3 15832652761 86D-00 

0.24659762972978D-01 

7 

-0.5279428 1631 87  ID-00 

0 . 3462014 1 498229D-0 1 

8 

-0.27I44262652324D-O1 

9 

-0.90014975943879D-01 

10 

0 . 33679245674349D-0 1 

II 

0.3 1042933 163383D-00 

12 

-0.531081 17920985D-00 

FREQUENCY(cycksftimpfc) 

Fig.  3.  Magnitude  response  plots  for  the  analysis  filters  in  example  2.2. 


TABLE  III 

Impulse  Responses  of  thf.  Optimized  Analysis  Filters  in  Example  2.3 


n 

fio(«) 

fi.(n) 

0 

— 0.50524665078789D-02 

-0. 19222322824304D-03 

1 

-0.220794391 19735D-0I 

-0.84002 161 296320D-03 

2 

0. 17886323630082D-0I 

0.I2I40454726470D-02 

3 

0.467200254668 17D-0I 

0.4 109 1254662707D-02 

4 

-0.4I230804736905D-01 

-0.56591 175708268D-02 

5 

-  0 .926225579488 1 OD-0 1 

-0. 18078827282842D01 

6 

0.13353939880734D-00 

0. 1 3995252437161 D-01 

7 

0.46283952040909D-00 

0.336I8364075630D-01 

8 

-  0.34 74303707302  ID-02 

9 

-0.56273810205632D-01 

10 

-0.24049238631 185D-01 

11 

0. 1087801 1285809D-00 

12 

0.1 127861 1773976D-00 

13 

-0.4766944 1078655D-00 

signal  i(n)  for  examples  2.1,  2.2.  and  2.3  are  shown  in 
Table  IV. 


III.  Lagrange-Newton  Method 

While  our  first  approach  is  simple  and  easy  to  use,  it 
would  probably  be  better  for  most  situations  to  avoid  the 
arbitrary  choice  of  H0(z).  Furthermore,  we  cannot  use  this 
approach  for  the  design  of  equal-length  filter  banks,  as  the 
degree  of  freedom  for  designing  //,  (z)  reduces  to  zero  in 


Fig.  4.  Magnitude  response  plots  for  the  analysis  fillets  in  example  2.3 


this  case.  Therefore,  a  systematic  approach  for  finding  the 
appropriate  H0(z)  and  H}  (z)  simultaneously,  in  some  op¬ 
timal  sense,  and  an  approach  which  deals  with  the  equal- 
length  case  is  needed. 

Our  second  approach,  which  we  call  the  Lagrange- 
Newton  method,  meets  all  these  requirements.  Here,  the 
impulse  responses  of  H0{z)  and  //,  (z)  are  treated  simul¬ 
taneously  as  unknowns.  This  makes  (3)  a  set  of  nonlinear 
perfect-reconstruction  constraint  equations.  Defining 

ao(n)  =  2ho  -  nj ,  n  =  1 ,  2,  •  •  •  ,  y 
and 


we  can  express  the  zero-phase  frequency  response  of  H0(z) 
as  a  scalar  product 

Ho(eJul)  =  yjso(u) 


where 


flod)  «o(2)  • 


>’o  - 


type  A 

bo(0)  £>o(l)  '  '  '  b0  ^  ^ 


and 


type  B 


o)  3  oo 


Vo  -  1 


cos  —  cos  —  •  •  •  cos 
2  2  2 


00 


s„(w)  =  i 


type  A 

i  ,  N0-  \ 

1  cos  oo  cos  2w  •  •  •  cos  — - —  00 


V  type  B. 
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TABLE  IV 

A  Ramp  Input  Sequence  jc(a)  and  the  Reconstructed  Sional  Km)  for  Example  2.1.  Example  2.2.  and 

Example  2.3 


/I 

x(n) 

Example  2.1 

Hn  +  23) 

Example  2.2 
jK«  +  19)  . 

Ciiflr  2.3 
+  21) 

0 

1.0000000000000 

1.0000000000000 

1.0000000000000 

10000000000000 

1 

2.0000000000000 

2.0000000000000 

2.0000000000000 

2  0000000000000 

2 

3.0000000000000 

3.0000000000000 

3.0000000000000 

3  0000000000000 

3 

4.0000000000000 

4.0000000000000 

4.0000000000000 

4.0000000000000 

4 

5.0000000000000 

5.0000000000000 

s. 0000000000000 

5.0000000000000 

5 

6.0000000000000 

6.0000000000000 

6.0000000000000 

6.0000000000000 

6 

7.0000000000000 

7.0000000000000 

7.0000000000000 

7.0000000000000 

7 

8.0000000000000 

8.0000000000000 

8.0000000000000 

8.0000000000000 

8 

9.0000000000000 

9.0000000000000 

9.0000000000000 

9.0000000000000 

9 

10.0000000000000 

10.0000000000000 

10.0000000000000 

10  0000000000000 

The  zero-phase  frequency  response  of  //,  (z)  can  be  ex-  be  expressed  as 


pressed  in  the  same  form  as  that  in  Section  II.  There  H0(z) 
and  W,(z)  are  designed  separately,  which  does  not  guar¬ 
antee  that  the  joint  square  error  is  minimal.  Here  we  pro-  where 
pose  an  approach  which  will  minimize  the  following  joint 
weighted  square  error: 


*  =  5>’oGo>’o  +  Poyo  +  do  +  jyfGiy,  +  p[y,  +  d, 


L 


[1  -  tfoV“)]2  do) 


Go  =  —  [  i0(u)ij(o))  do)  +  — -a?°  (  j0(o>)So(“)  do> 

X  Jo  X  J  u)jo 

SwpO 

sl(u)  do) 

0 


Po 


+  aj0  [//0VU)12  do) 

Ju/jO 

+  «,  f  [  n  -  do) 

L  i 

+  L 


do  ~ 


a0o)po 

2r 


<2, 


=  <*i  r 

x  Ju,pl 


Si  (ai)s  f(w)  do)  + 


r 

x  Jo 


s,(w)s[(w)  do) 


[Hti^rfdo) 


PT\  =  \  *[(«)  do) 

X  Jwpi 


and 

where  a ^  and  asl  are  the  stopband  weighting  factors  for 
H0{z)  and  //, (z),  respectively,  and  a0  and  a,  are  the  dx  = 
weighting  factors  for  the  whole  approximation  errors  of 
H0(z)  and  //,  (z),  respectively.  This  objective  function  can  For  type  A  systems,  the  elements  of  Go.  Po.  Gi.  and  p, 
— - -  are  given  as  follows: 


«i(t  ~  uPi) 

2x 


<7o0' 


..  “off 

,J)  =  —  ]  \  COS 
X  (  Jo 


1  ^  ij  *  ~ 


>  ~  j  1  “ 


N„ 


cos 


j  -  ij  j  *>  +  L  cos  [(<  -  i)  »]  COS  [(j  -  i)  „]  d 


a„  ftopo  +  <**)(*  ~  <Jk>)  sin  [(2/  -  1)^)  -  sin  [(2i  -  Dw^]) 


4  i  -  2 


<  =  J 


=  \  a0  (sin  W  ~  yVpol  ~  a,o  sin  [(i  -  j) o^p]  01,0  sin  [(/  +  j  -  lju^j  -  sin  f(i  +  J  -  !)«,*,] 


2(i  -  J) 


2(«+7  "  1) 


]• 


“7 1  cos[('  -  i)  "]“*“*  -7 


sin 


[('-0 


i  *  j. 


O)p0 


.  No 

1  <I<T. 


*’2 
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^  [«„  jo  sin  [(.  -  0  «]  sin  [(j  -  5)  »]  <*•  +  sin  [(/  -  5)  «]  sin  [(j  -  5)  »]  4n] 


1  ^  i.j  *  y 


j'giiigii 


+  x  -  «„  sin  1(2*  -  Do),,]  -  Oj|  sin  [(2/  -  l)w„]' 
2  +  4/  -  2  ' 


a,  (<*s\  sin  I(i  -_/)«,,]  -  sin  [(*  -  j)upl]  asl  sin  ((/  +  j  -  l)w,,]  -  sin  [(/  +  j  -  l)wplr 


20  -  j) 


i  *  j. 


a.  r  .  7.  1\  1  .  a, 

Pu  = - 1  sm  1/  -  -  «  du  = - 

x  L\  2/  J  x 


2  (i  +;-l) 


N| 

1  “  '  “  r 


For  type  B  systems,  the  elements  of  Q0,  p0,  Qh  and  p,  are 


COS  (iu)  COS  (7<jj)  dial  + 


«,r 


COS  (lw)  COS  ()w)  dw  [ ,  0  <  I,  7  < 


No  -  1 


aofajo*  +  Wpo  - 


Up0  +  «*)(*■  -  Wjo)  sin  (2/Wpo)  -  sin  (2/^) 
2  +  4/ 


oo  pi"  (O'  ~  y)tQpo]  -  Wrf)  sin  [(*'  -  j) cj^]  sin  [(/  +  jOu^q]  -  a,o  sin  [(/  +  7')a>i0] 

x  [  2 (i  -  ./)  +  2(i  +  j) 


“of 
■  ~7  J. 


/V  -  1 

cos  (iu)  du,  0  <  /  <  - 

o  2 


“oo’po 

i 

X 

a0  sin  (/cjpo) 
ix 


IT  ^  Jwpi 


1  =  0 

n  (tWpo) 

t — — ,  i  *  0. 

IX 

cos  (/«)  cos  (7«)  + 


JU| 

0 


cos  (iw)  cos  (7«)  du  > ,  0  <  /,  7  < 


N.  -  1 


fa,  («,,«„  +  x  -  wpl) 

2i 

T 

“ilWjl  +  x  -  «p, 

X 

2 

<*i 

raI(  sin  [(*  -  7')a)J| 

x 

V. 

2(i 

+  2(i  +  t) 


“i  fT  , ■  v  ^  «  ■  Ni  -  1 

Pi.,  = - 1  cos  (i«)  dai,  0  <  i  <  — - — 

T  Jutol  2 


H  v'tJpl 

“|(X  ~  Mp|) 

T 

a,  sin  (iupi ) 

.  t 

IX 


,  i  =  0 

/  *  0. 


«  =  7  =  0 
i  =  7  *  0 
/  *  7. 


i  =7  =  0 

i  =  7  *  0 

i  *  j- 
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Next,  we  can  easily  conveit  the  set  of  nonlinear  con¬ 
straint  equations  (3)  into  the  following  form: 

u,(y0,  >i)  =  ylb,yi  =  o,  /  =  l,  2,  •••,*-  l 

.«/(yo.  y i)  =  yoAy/  -  ?  =  0,  l  =  k 


where  D,  can  easily  be  determined  when  the  lengths  of  the 
filters  are  given.  Defining 

y  =  f>o  yW 


and 

u(y)  =  («i(y)  u2(y)  ■  •  •  uk(y)]T 

we  can  formulate  our  optimization  problem  as 

min  $()’)  subject  to  u(y)  =0  (8) 

which  is  a  nonlinear  programming  problem  with  nonlinear 
constraints. 

This  problem  can  be  solved  iteratively  using  the  La- 
grange-Newton  method  [16].  The  Lagrange  function  is 

Uy,  X)  =  $(}’)  -  Xru 

where 

X  =  [X,  X2  •  •  •  X*]r 

We  define 

V  =  [Vj 

Then,  the  condition  for  the  stationary  point  y*,  X*  is 
VL(y*,  X*)  =  0. 

Expanding  VL  in  a  Taylor  series  about  /'*,  X(,)  yields 


VL(  _v(,)  +  fiy,  X(,)  +  6X) 


=  VL( yv\  X(0)  +  [V2L(y(,),  X(0)l 


+  •  •  •  . 


Neglecting  higher  order  terms  and  setting  the  left-hand 
side  to  zero  gives  the  iteration 


[V2L(  y°\  X(,))] 


-VL(y0>,  X(i»). 


This  is  solved  to  give  corrections  by  and  6\.  Defining 
X«  + 1)  _  ^d)  +  fa  we  then  0btain  the  following  system: 

GU)  -Aur 
_-AU)T  0 

where 

GV)  =  V2L(,) 

AU)  =  [VmV  Vu:”  •  •  •  V«i°] 
go>  =  v^o> 

and  where  the  superscript  (i )  represents  that  the  values  are 
evaluated  at  the  ith  iteration  y01  and  X('\  Notice  that  the 
linear  system  of  (9)  has  the  same  form  as  (5)  and  can 


.  (i  +  D; 


~g 


(OS 


(9) 


readily  be  solved,  giving 

6 y  =  -G~'A(AtG~'A)~'u 

+  G~llA{ATG  lAr'ATG  l  -  I)g  (10) 

X(i  +  ,)  =  (ATG-'Ayl(ATG~lg  -  u).  (11) 

The  analytical  forms  for  A,  G,  and  g  can  easily  be  derived 
as  follows: 


Dj(lr)y,  •  • 

Dt(lr)y, 

D,  (2r)y, 

D2(2r)y,  •  • 

Dk(2r)yt 

D,  (Nr)y, 

D2(Nr)y ,  •  • 

Dk(Nr)y{ 

yoA(lc) 

ylD2(lc)  ■  ■ 

yffAdc) 

To  A  (2c) 

ylD2(2c)  •• 

To  A  (2c) 

ylD^Mc)  ylDiiUc)  yT0Dt(Mc) 


where  Dk(Nr)  and  Dk(Mc )  represent  the  Mh  row  and  the 
Afth  column  of  Dk,  respectively,  and 

for  type  *A 
for  type  B 

for  type  A 
for  type  B 

k 

-  S  \,D, 

i*  I 

|_- KD!  & 

_  /  Qo)'o  +  Po 
Ve.yi  +  Pi 

Therefore,  we  simply  form  A,  G,  g ,  and  u,  and  use  (10) 
and  (1 1)  to  find  by  and  X°  +  n.  Then,  yu  * 11  is  given  by 

y«*n-yn  +  fy.  (12) 

Example  3.1:  A  type  B  system  with  N0  =  23,  /V,  = 
25,  Wpo  =  w,i  =  0.4t,  and  wp |  =  =  0.6t  was  de¬ 

signed.  Our  Lagrange-Newton  method  requires  initial  ap¬ 
proximations  y(l>  and  X°\  and  uses  ( 10)— ( 12)  to  generate 
the  iterative  sequence  { y0>,  X(,)}.  As  with  most  nonlin¬ 
ear  optimization  problems,  our  computer  simulations 
showed  that  the  solution  was  sensitive  to  the  initial  ap¬ 
proximations.  However  for  unequal-length  filter  banks, 
our  first  method,  the  Lagrange  multiplier  method,  served 


M  = 


and 


N  = 


G  = 


fN o 

2  ’ 

N0  +  1 
^  2 


Qo 
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Fig.  5.  Magnitude  response  plots  of  the  proposed  approach  and  the  lattice 
approach  in  example  3.1. 


TABLE  V 

Impulse  Responses  of  the  Optimized  Analysis  Filters  in  Example  3.1 


n 

ho(") 

*,(«) 

0 

0. 196643 10885798D-02 

0.26030504425555D-03 

1 

-0. 1507 1 897603 198D-01 

-0. 1995 1 327027936D-02 

2 

0.3953872 54601 62D-02 

0. 14764582380207D-02 

3 

0.2487860563324  ID-01 

-0.401 15824373647D-02 

4 

-0. 14200153852088D-01 

-0.89 1820 16597635D-03 

5 

-0. 3600601 5487364D-0I 

0. 144074 16308426D-0I 

6 

0  3546 1028533 182D-01 

-0.234553 16659266D-02 

7 

0.47745012 128669D-01 

-0.3597861024072  ID-01 

8 

-0.89925971 10409  ID-01 

0. 106621070I0783D-0I 

9 

-0  53096857923338D-01 

0.9123851 1647933D-OI 

10 

0.3 104284 70370I4D-00 

-0. 87734 159524355D-02 

II 

0.55 179 158942662D-00 

-  0.3 149594 1 8975  lOD-OO 

12 

0.51 290303596677 D-00 

as  an  easy  way  to  approximate  the  initial  estimates.  We 
simply  used  the  results  of  example  2.1  as  the  initial  ap¬ 
proximations:  1)  We  designed  a  23-tap  low-pass  eigenfil- 
ter  H0(z)  with  =  0.5x  and  =  0.6x  to  get  y(0'\  2) 
We  used  (6)  and  (7)  to  find  y,1'  and  X<!),  respectively. 

Then,  y(l)  =  [yo’r  >’ V>r]r  and  X(l)  were  used  itera¬ 
tively  to  find  the  optimal  solution.  With  a0  =  a,  = 

=  a,  i  =  1 ,  our  algorithm  converged  to  the  solution  within 
11  iterations.  The  magnitude  response  plots  of  H0(z)  and 
H{(z)  are  shown  in  Fig.  5.  The  coefficient  of  H0(z)  and 
//i(z)  are  shown  in  Table  V.  To  compare  with  other  re¬ 
sults  reported  recently,  the  magnitude  response  plots  of 
the  lattice  approach  of  [1 1]  are  also  shown  in  Fig.  5,  and 
the  peak  ripples  in  the  passband  and  stopband  are  sum¬ 
marized  in  Table  VI.  It  is  evident  that  the  proposed  ap¬ 
proach  has  smaller  peak  ripples. 

Example  3.2 :  A  type  A  system  with  N0  =  V,  =22, 
Wpo  =  <*>ji  =  0.4x,  and  =  0.6x  was  designed. 

For  such  an  equal-length  system,  our  computer  experi¬ 
ments  showed  that  the  JMSE  filters  [17]  served  as  good 
candidates  for  the  initial  approximations.  These  filters 
were  designed  by  approximating  the  ideal  brick-wall  half¬ 
band  filters  using  the  downhill  simplex  method  [18].  We 


TABLE  VI 

Comparison  Between  the  Proposed  Approach  and  the  Lattice 
Approach  for  Example  3.1.  Here  i,  and  4,  Denote  the  Peak -Ripple 
Sizes  in  the  Passband  and  Stopband.  Respectively 


Lattice  Approach 

Proposed  Approach 

„  jjaasL. 

Ho 

«, 

Ho 

H, 

H„ 

H, 

0.0327 

0.0349 

0.0224 

0.0230 

146 

1.51 

6. 

0.0449 

0.0267 

0.0307 

0.0171 

146 

1.56 

Fig.  6.  Magnitude  response  plots  for  the  initial  analysis  filters  in  example 

3.2. 


Fig.  7.  Magnitude  response  plots  for  the  analysis  filters  in  example  3.2. 


simply  used  the  computer  program  in  [17]  to  obtain  our 
initial  approximations  for  yni,  and  set  X(,)  =  0.  The  ini¬ 
tial  magnitude  response  plots  are  shown  in  Fig.  6.  With 
<Xo  =  1,  ai  =  2,  a,,,  =  1,  and  a,t  =  0.8  the  solution  was 
obtained  within  8  iterations.  Here,  by  adjusting  the 
weighting  factors,  various  filter  performance  criteria  can 
be  accommodated.  For  the  chosen  weighting  factors,  we 
were  able  to  obtain  better  filtering  performance  than  [II]. 
The  magnitude  response  plots  of  H0(z)  and  Hs  (z)  are 
shown  in  Fig.  7.  The  coefficients  of  H0(z)  and  //,  (z)  are 
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m 


TABLE  VU 

Impulse  Responses  of  the  Optimized  Analysis  Filters  in  Example  3.2 


n 

^o(rt) 

*.(») 

0 

0. 132 14141 754298D-02 

0.2448901 1491443D-03 

1 

-0.1358873255S0I9D-0! 

-0.2518321915I237D-02 

2 

0. 163610014284 50D-0I 

0. 2487258744 1326D-02 

3 

0.1447 193584 1032D-01 

0. 828476704442 18D-02 

4 

-  0. 3308343628823SD-0 1 

-0  1I927299751857D-01 

5 

-0. 10335603320I86D-01 

-0. 17648198575570D-01 

6 

0.6023 14552208 16D-01 

0  339637073 14799D-01 

7 

-0. 93 18373 1 5764 13D-02 

0.396460 1 7249692  D-01 

8 

-0. 1 1708023337247D-00 

-0.902923131 19252D-01 

9 

0  10 1493062402 88 D-00 

-0.1 3843420538781  D-00 

10 

0.4837 1640962495D-00 

0. 460733247046 70D-00 

TABLE  VIII 

Comparison  Between  the  Proposed  Approach  and  the  Lattice 
Approach  for  Example  3.2.  Here  6,  and  4,  Denote  the  Peax-Ripple 
Sizes  in  the  Passband  and  Stopband.  Respectively 


Lattice  Approach 

Proposed  Approach 

^CRlKt 

H0 

ff, 

Ho  H, 

Ho 

H, 

0.0246 

0.0260 

0.0133  0.0252 

1.85 

1.03 

6, 

0.0592 

0.0307 

0.0400  0.0234 

1.48 

1.31 

TABLE  IX 

A  Ramp  Input  Sequence  jc(/i)  and  the  Reconstructed  Signal  i(n)  for 
Example  3. 1  and  Example  3.2 

Example  3.1 

Example  3.2 

n 

X(H) 

i(n  +  23) 

i(n  +  21) 

0 

1.0000000000000 

1.0000000000000 

1.0000000000000 

1 

2.0000000000000 

2.0000000000000 

2.0000000000000 

2 

3.0000000000000 

3.0000000000000 

3.0000000000000 

3 

4.0000000000000 

4.0000000000000 

4.0000000000000 

4 

5.0000000000000 

5.0000000000000 

5.0000000000000 

5 

6.0000000000000 

6.0000000000000 

6.0000000000000 

6 

7.0000000000000 

7.0000000000000 

7.0000000000000 

7 

8.0000000000000 

8.0000000000000 

8.0000000000000 

8 

9  0000000000000 

9.0000000000000 

9.0000000000000 

9 

10.0000000000000 

10.0000000000000 

10.0000000000000 

shown  in  Table  VII.  The  comparison  of  peak  ripples  with 
[11]  is  summarized  in  Table  VIII. 

In  order  to  demonstrate  the  perfect-reconstruction  prop¬ 
erty  of  the  proposed  approach,  a  computer  simulation  with 
double-precision  arithmetic  was  run  for  a  simple  ramp  in¬ 
put  sequence  x(n).  The  reconstructed  signal  x(n)  for  ex¬ 
amples  3. 1  and  3.2  are  shown  in  Table  IX. 

IV.  Conclusions 

We  have  presented  two  new  approaches  to  the  design 
of  two-channel  perfect-reconstruction  linear-phase  FIR 
filter  banks.  Using  these  Lagrange  multiplier  approaches, 
we  have  been  able  to  formulate  the  design  problem  first 
as  a  quadratic  programming  problem  with  linear  con¬ 


straints,  and  then  as  one  with  nonlinear  constraints 
Closed-form  solutions  for  the  first  approach,  and  for  the 
iterative  problem  in  the  second  approach  have  been  de¬ 
rived.  Several  design  examples  have  been  given  to  show 
the  effectiveness  of  the  proposed  approaches.  When  com¬ 
pared  to  other  results  recently  reported,  the  proposed  ap¬ 
proaches  appear  to  have  better  filtering  performance.  One 
further  observation  about  the  first  approach  is  that,  when 
the  optimal  infinite-precision  impulse  response  of  ho(n)  is 
rounded  to  the  nearest  power-of-two  coefficients,  we  can 
still  obtain  the  impulse  response  of  h{(n)  by  using  (6), 
and  we  thus  obtain  a  perfect-reconstruction  system  with 
low -complexity  W0(z). 
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The  Design  of  Low-Complexity  Linear-Phase 
FIR  FiLjr  Banks  Using  Powers-of-Two 
Coefficients  with  an  Application 
to  Subband  Image  Coding 

Bor-Rong  Homg,  Member,  IEEE,  Henry  Samueli,  Member,  IEEE,  and  Alan  N.  Willson,  Jr.,  Fellow,  IEEE 


Abstract— An  optimization  technique  is  presented  for  the 
design  of  multiplierless  two-channel  linear-phase  finite-duration 
impulse-response  (FIR)  filter  banks.  It  is  shown  to  yield  filter 
banks  with  good  filtering  performance  and  nearly  perfect  signal 
reconstruction.  The  design  employs  filters  whose  coefficients  are 
represented  by  a  canonic  signed-digit  (CSD)  code.  When  applied 
to  subband  image  coding  this  technique  provides  an  easy  way  to 
design  low-complexity  analysis/synthesis  filter  banks  for  high- 
performance  codecs.  Examples  concerning  filter  design  and  the 
application  of  such  filters  to  subband  image  coding  are  given. 


I.  Introduction 

SUBBAND  image  coding  has  recently  been  shown  to  be 
an  effective  technique  for  image  compression  [II- (4). 
Although  this  technique  can  yield  high-quality  coding  sys¬ 
tems  at  low  bit  rates,  it  generally  requires  the  implementa¬ 
tion  of  sophisticated  analysis /synthesis  filter  banks,  which 
increases  system  complexity.  The  filter  bank’s  operation 
requires  numerous  multiplications  and  additions.  Multiplica¬ 
tion.  in  particular,  is  extremely  time  consuming.  With 
current  advanced  very-large-scale  integration  (VLSI)  tech¬ 
nologies,  fast  multipliers  (operating  at  speeds  exceeding  100 
MHz  [5])  are  available.  Such  multipliers  employ  highly 
parallel  processing,  which  requires  a  large  chip  area.  If  filter 
banks  were  employed  in  high-speed  applications  such  as 
real-time  image  compression  systems,  a  separate  fast  multi¬ 
plier  would  probably  be  required  for  each  filter  coefficient, 
which  would  surely  be  unacceptable  from  a  hardware-com¬ 
plexity  point  of  view.  However,  if  a  multiplication  operation 
could  be  replaced  by  only  a  few  additions  or  subtractions 
then  the  complexity  of  the  entire  analysis/synthesis  filter 
bank  would  be  reduced  quite  dramatically  to  a  point  where  its 
implementation  in  a  fast  real-time  system  becomes  feasible. 
In  this  paper  we  show  how  such  a  goal  can  be  achieved.  We 
employ  a  discrete  coefficient  optimization  technique  to  design 
two-channel  linear-phase  finite-duration  impulse-response 
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(FIR)  filter  banks  that  exhibit  good  filtering  performance  and 
nearly  perfect  signal  reconstruction  while  requiring  far  less 
hardware  complexity  in  comparison  to  conventional  filter 
banks  using  floating-point  coefficients  [6J-[8], 

We  represent  our  filter  coefficients  by  a  radix-2  canonic 
signed-digit  (CSD)  code  [91.  By  adding  the  flexibility  of 
negative  digits  to  a  conventional  binary  code,  the  radix-2 
signed-digit  representation  of  a  fractional  number  c  is  given 
by 

r=£s*2~p*  (1) 

*  =  i 

where  sk  6  {  -  1, 0,  1}  and  pk  e  {0,  !,••■,  M).  The  number 
representation  specified  by  (1)  has  M  +  1  total  (ternary) 
digits  and  L  nonzero  digits.  The  number  of  adders/subtrac¬ 
tors  required  to  realize  such  a  coefficient  is  L  -  1,  one  less 
than  the  number  of  nonzero  digits.  In  general  there  arc 
several  signed-digit  representations  for  a  given  number.  The 
CSD  code  is  that  representation  with  the  minimum  number  of 
nonzero  digits  and  for  which  no  two  nonzero  digits  sk  are 
adjacent.  A  well-known  feature  of  the  CSD  code  is  its  ability 
to  represent  most  numbers  with  many  fewer  nonzero  digits. 
For  example,  the  8-bit  two’s  complement  representation  of 
127/128  =  0.9921875  has  seven  nonzero  digits  (0.1111111) 
whereas  the  eight-digit  radix-2  CSD  representation  of  the 
same  number  has  only  two  nonzero  ternary  digits  as  given  by 
1.0000001,  where  1  denotes  -  1.  Thus,  only  a  single  sub¬ 
tractor  would  be  required  to  implement  a  multiplier  with  a 
coefficient  having  this  value.  It  is  this  feature  of  the  CSD 
code  that  makes  it  possible  to  design  low-complexity  high- 
performance  filter  banks  suitable  for  single-chip  VLSI  imple¬ 
mentations. 

II.  Two-Channel  Filter  Banks 

Subband  image  coding  involves  the  design  of  two-dimen¬ 
sional  filter  banks.  In  the  present  work  we  restrict  our 
attention  to  the  simple  case  of  a  two-channel  system  in  one 
dimension.  Such  filter  banks  can  be  cascaded  in  a  tree 
structure  to  provide  an  arbitrarily  fine  division  of  the  signal, 
in  frequency,  and  can  be  directly  applied  to  two-dimensional 
subband  image  coding  systems  by  using  separable  filter  banks 
[10J  which  first  perform  the  filtering  on  the  rows  and  then  on 
the  columns  of  an  image. 

A  generic  two-channel  FIR  filter  bank  is  shown  in  Fig.  1 . 
Here  H0(- :  and  H,(z)  represent  the  lowpass  and  highpass 
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Fig.  I.  Two-channel  analysis /synthesis  filter  bank  system. 


filters,  respectively,  in  the  analysis  bank,  and  G0(z)  arvJ 
G|(z)  are  the  synthesis  filters.  Assuming  perfect  channels 
and  codecs,  it  is  well  known  that  the  reconstructed  signal 
x(n)  can  be  related  to  the  input  signal  x(n)  by 

X(z)  =  \[H0(z)G0(z)  +  //,(z)G,(z)]  X(z) 

+  \[H0{-z)G0{z)  +  H,{-z)Gy{z)]X{-z). 

Furthermore,  by  choosing 

G0(z)  =  2Hx(-z) 

G,(z)  =  -2  H0(-z) 

we  have 

X(z)  =  [/f0(z)//,(-z)  -  H0(~z)Hl(z)\X(z).  (2) 
If  we  impose  the  following  pure-delay  constraint 


Case  3)  The  impulse  response  is  antisymmetncal  and  the 
filter  length  is  odd. 

Case  4)  The  impulse  response  is  antisymmetncal  and  the 
filter  length  is  even. 

Then,  for  type  A  systems  the  analysis  filters  are  either  case 
2  or  case  4  since  the  lengths  are  even.  It  can  easily  be  shown 
that  case  2  requires  Hx(eJm)  =  0  and  thus  it  cannot  realize  a 
highpass  filter.  Similarly,  case  4  cannot  realize  a  lowpass 
filter.  Therefore  H,(z)  must  be  case  4  and  H0(z)  must  be 
case  2. 

For  type  B  systems,  it  is  obvious  that  both  H0(z)  and 
Ht(z)  must  be  case  1.  Furthermore,  by  examining  the  pure- 
delay  constraint  (3),  and  by  considering  the  coefficient  sym¬ 
metry/antisymmetry  of  linear-phase  filters,  we  can  make  the 
following  observations  (assuming  the  lengths  of  h0(n)  and 
h{(n)  to  be  /V0  and  Nx,  respectively): 

1)  The  sum  of  the  lengths  must  be  a  multiple  of  4  [8]. 

2)  The  number  of  independent  constraint  equations  in  (3) 
reduces  to  (A/0  +  N,)/4. 

3)  The  constraint  equations  can  be  expressed  as 


/- 


N0  +  N, 


=  £  (-1)*M2»-  i  -*)*,(*). 


AL  +  N, 

i=  1.2,-,-lj— (4) 


It  should  be  noted  here  that  for  type  B  systems  h0(/t) 
=  0  for  n  >  N0  and  hx(n)  =  0  for  n  ^  Nx.  By  adding 
the  coefficient  symmetry /antisymmetry  of  the  linear- 
phase  filters  (4)  can  be  expressed  in  the  matrix  form 


H0(z)Hx(-z)  -  Hx(z)H0(-z)  =  *-2‘+1.  (3) 

then 

X(z)  =  z~2k+'X(z). 

Thus,  we  obtain  a  perfect  reconstruction  system  where  the 
output  x(n)  is  a  delayed  replica  of  the  input  x(n).  In  some 
applications  linear-phase  filters  are  preferred  over  nonlinear- 
phase  filters  for  image  coding  [11).  As  described  in  [8], 
by  combining  the  linear-phase  constraint  with  the  pure- 
delay  constraint  there  are,  in  total,  16  possible  types  of 
H0(z),  Hx(z)  pairs  to  consider,  only  two  of  which  yield 
nontrivial  analysis  filters: 

1)  Both  filters  have  even  length  and  opposite  symmetry, 
denoted  in  [8]  as  type  A  systems. 

2)  Both  filters  have  odd  length  and  are  symmetric,  denoted 
in  [8]  as  type  B  systems. 

Using  terminology  defined  in  [12)  there  are  four  cases  of 
linear-phase  FIR  filters,  depending  on  whether  the  filter 
length  is  odd  or  even  and  whether  the  impulse  response  is 
symmetrical  or  antisymmetrical: 

Case  1)  The  impulse  response  is  symmetrical  and  the  filter 
length  is  odd. 

Case  2)  The  impulse  response  is  symmetrical  and  the  filter 
length  is  even. 


Cyx  =  m  (5) 

where  yx  is  an  (/,  +  1  [-dimensional  column  vector 

yx  =  [  Ai(0)A|(l)  MOr 

m  is  an  (N0  +  Nx )  /4-dimensional  column  vector 

m  =  [0  -  -  0  ijr 

C  is  an  (N0  +  N,)/4-by-(/,  +  1)  matrix  with  the  ele¬ 
ments  formed  by  h0(n),  n  =  0,  1,  ■  •  ■ ,  l0  and 

N0  N, 

/o=y-l  /,-=  y  -  1  -  type  >1 


These  constraint  equations  are  nonlinear  because  both  h0(  n) 
and  hx(n)  are  involved.  However,  if  the  lowpass  impulse 
response  h0(n)  is  given,  they  then  become  linear.  It  has  been 
pointed  out  in  [13)  that  if  only  the  coefficients  of  the  lowpass 
filter  H0(z)  are  restricted  to  CSD  coefficients  then  good 
perfect-reconstruction  systems  can  be  obtained.  The  design 
procedure  is  given  as  follows: 

1)  The  lowpass  impulse  response  A0(a)  is  obtained  by 
using  the  Lagrange -Newton  method  in  [6].  It  is  rounded 
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to  the  nearest  CSD  code,  with  the  number  of  nonzero 
digits  as  low  as  possible.  This  assures  the  low  complex¬ 
ity  of  the  lowpass  filter. 

2)  Depending  on  whether  or  not  the  filter  lengths  are 
equal,  the  highpass  filter  //,(z)  is  obtained  by: 

(a)  The  Unequal-Length  Case:  The  rounded  lowpass 
impulse  response  is  used  in  the  Lagrange  multiplier 
method  in  [6]  to  obtain  the  highpass  impulse 
response. 

(b)  The  Equal-Length  Case:  The  rounded  lowpass  im¬ 
pulse  response  is  used  to  obtain  the  C  in  (S),  and 
then  the  highpass  impulse  response  is  obtained  by 

y,  =  C~'m.  (6) 

The  perfect-reconstruction  filter  banks  so  obtained  would 
require  multipliers  only  for  the  implementation  of  Hx(z) 
because  each  coefficient  of  H0(z)  could  be  implemented  by 
using  only  a  few  adders  or  subtractors.  We  then  need  only 
concentrate  on  searching  for  a  suitable  set  of  highpass  CSD 
coefficients  to  achieve  a  totally  multiplierless  design.  This 
procedure  serves  to  provide  a  good  starting  point  for  the  CSD 
search  technique  described  in  the  next  section. 

III.  The  Optimization  Algorithm 

The  main  core  of  our  proposed  discrete  optimization  algo¬ 
rithm  is  the  two-stage  local  search  strategy  recently  reported 

[14]: 

Stage  1)  We  search  for  the  optimal  scale  factor,  given  L 
and  M  in  (1),  and  we  assign  one  more  nonzero 
digit  to  those  coefficients  whose  magnitude  ex¬ 
ceeds  5,  such  that  an  appropriate  objective  func¬ 
tion  is  minimized. 

Stage  2)  We  use  a  bivariate  local  search  technique  [15]  to 
find  the  best  set  of  CSD  coefficients,  in  the 
neighborhood  of  the  scaled  and  rounded  coeffi¬ 
cients,  which  minimizes  the  objective  function. 

This  two-stage  local  search  strategy  has  been  shown  to  be 
very  efficient  for  finding  a  nearly  optimal  set  of  CSD  coeffi¬ 
cients  in  unconstrained  FIR  filter  design  [14],  Therefore  we 
adopt  this  strategy  and  modify  it  to  fit  the  needs  of  our 
filter-bank  design  problem. 

When  the  filter  bank  coefficients  are  restricted  to  a  rela¬ 
tively  sparse  set  of  coefficients  such  as  the  CSD  coefficients, 
the  constraint  equations  (4),  which  embody  the  perfect-recon¬ 
struction  property,  generally  will  not  be  satisfied.  Therefore, 
we  must  establish  an  objective  function  that  will  yield  good 
filtering  performance  while  adhering  to  (4)  as  closely  as 
possible.  A  reasonable  objective  function  would  be  a  joint 
weighted  function  of  these  two  requirements.  However,  our 
computer  simulations  have  shown  that  the  constraint  imposed 
by  (4)  is  considerably  more  dominant  than  that  of  the  filtering 
requirement.  Thus,  the  objective  function  to  be  minimized  is 
chosen  as 

-tf0( | -1.0} |  (7) 

which  is  the  peak  ripple  of  the  signal-reconstruction  error. 


Our  proposed  CSD  optimization  algorithm  is  described  as 
follows: 

1)  •  We  employ  the  coefficients  obtained  by  using  the  proce¬ 
dure  in  Section  II  as  the  starting  point. 

2)  The  two-stage  local  search  strategy  is  then  adopted  to 
search  for  the  optimal  set  of  CSD  coefficients  for  Ht(z) 
such  that  (7)  is  minimized. 

3)  Finally,  a  scale  factor  (SF)  is  needed  to  scale  the  signal 
level  back  such  that  the  overall  system  transfer  function 
is  close  to  1 .  This  scale  factor,  which  is  also  rounded  to 
the  nearest  CSD  code,  can  be  inserted  right  after  the 
output  of  Hx(z)  and  G0(z). 

Two  design  examples  are  now  given  to  illustrate  the 
proposed  technique. 

Example  /:  An  equal-length  case  for  type  A  systems  is 
designed,  where  both  H0(z)  and  Ht( z)  have  22  taps.  The 
infinite-precision  coefficients  resulting  from  the 
Lagrange -Newton  method  in  [6]  are  used  as  the  starting 
point.  L  =  2  and  M  =  16  are  chosen  for  //0(z),  and  L  =  4 
and  M  =  16  are  chosen  for  Hx(z).  The  resulting  CSD 
coefficients  are  shown  in  Table  I.  The  total  number  of 
adders /subtractors,  including  the  scale  factor,  to  implement 
the  entire  analysis  filter  bank  is  85.  Recently  there  have  been 
several  high-speed  VLSI  single-chip  implementations  of  CSD 
FIR  filters  [16],  [17].  In  [17]  a  64-tap  CSD  FIR  linear-phase 
filter,  working  at  video  rate,  has  been  implemented  on  a 
single  chip.  These  results  show  that  it  is  feasible  to  imple¬ 
ment  our  CSD  filter  bank  on  a  single  chip  using  modem 
VLSI  technology  since  our  filter-bank  complexity  is  less 
complicated  than  that  of  the  64-tap  filter. 

Fig.  2  shows  the  magnitude  response  plots  of  the  infinite- 
precision  optimal  system  and  the  CSD  optimal  system.  As  we 
can  see,  the  filtering  performance  of  the  CSD  design  is 
almost  as  good  as  that  of  the  infinite-precision  design.  The 
hardware  complexity,  however,  has  been  reduced  signifi¬ 
cantly.  The  price  paid  for  the  CSD  design  is  the  loss  of 
perfect  signal  reconstruction.  The  overall  system  magnitude 
response  of  the  CSD  design  is  shown  in  Fig.  3.  Here  we  see 
that  the  reconstruction  error  of  the  CSD  design  is  less  than 
0.00026  dB.  Such  an  extremely  small  reconstruction  error  is 
believed  to  be  negligible  in  practice.  This  belief  is  also 
supported  by  the  subband  image  coding  experiment  described 
in  the  next  section. 

Example  2:  An  unequal-length  case  for  type  B  systems  is 
designed.  Here,  H0(z)  has  23  taps  and  //,( z)  has  25  taps. 
Again,  the  starting  point  is  obtained  from  the  use  of  the 
Lagrange -Newton  method  [6].  Again,  L  =  2  and  M  =  16 
are  chosen  for  H0(z),  and  L  =  4  and  M  =  16  are  chosen 
for  Hx(z).  Fig.  4  shows  the  resulting  magnitude  response 
plots  of  the  infinite-precision  optimal  system  and  the  CSD 
optimal  system.  Fig.  5  shows  the  overall  system  magnitude 
response  of  the  CSD  design.  Again,  we  observe  that  the 
filtering  performance  of  the  CSD  design  is  almost  as  good  as 
that  of  the  infinite-precision  design.  The  reconstruction  error 
is  less  than  0.00013  dB,  which  we  also  believe  to  be  negligi¬ 
ble  in  practice.  The  resulting  CSD  coefficients  are  shown  in 
Table  II.  The  total  number  of  adders/subtractors,  including 


TABLE  I 

CSD  Coefficients  For  Example  1 


n 
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2-4  +  2-7  _  2-'°  -  2- ,'*2“ 14 

the  scale  factor,  for  implementing  the  entire  analysis  filter 
bank  is  88. 

IV.  Application  to  Subband  Image  Coding 

We  wish  to  compare  the  performance  of  a  multiplierless 
filter  bank  with  other  filter  banks,  when  applied  to  subband 
image  coding.  The  subband  coding  system  used  here  was 
developed  by  Darragh  and  Baker  [3],  [4].  This  system  em¬ 
ployed  enumerative  Laplacian  quantization,  which  is  made  up 
of  a  scalar  uniform  threshold  quantizer  in  cascade  with  an 
entropy  encoder  specifically  tailored  to  the  quantizer  output 
statistics,  for  nonbaseband  subbands,  and  differential  pulse 
code  modulation  for  baseband.  Methods  for  allocating  the 
rate  among  subbands  predicated  on  subband  quantizers  were 
then  established.  The  fixed-distortion  subband  coding  algo¬ 
rithm  (FDSBC)  [3]  solves  the  problem  of  minimizing  the 
total  bit  rate  subject  to  a  constraint  on  allowable  mean-square 


FREQUENCY  (cydotenpk) 

Fig.  3.  System  magnitude  response  plot  of  (be  CSD  design  in  Example  1 . 


FREQUENCY  (cydMtnnple) 

Fig.  4.  Magnitude  response  plots  of  the  CSD  design  and  the  infinite- 
precision  design  in  Example  2. 
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Fig.  5.  System  magnitude  response  plot  of  the  CSD  design  in  Example  2. 
TABLE  U 

CSD  Coefficients  for  Example  2 
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distortion,  whereas  the  fixed-rate  subband  coding  algorithm 
(FRSBC)  [4]  minimizes  the  mean-square  error  in  the  recon¬ 
structed  image  for  a  prescribed  total  bit  rate.  An  original 
256  x  256  pixel  image,  represented  by  8bits/pixel,  was  en¬ 
coded  using  the  FDSBC  algorithm,  targeted  at  33.36  dB,  and 
the  FRSBC  algorithm,  targeted  at  0.5  bits/pixel,  respec¬ 
tively.  The  filter  banks  tested  are  the  22-tap  CSD  filter  bank 
(CSD-22)  in  Example  1,  the  22-tap  infinite-precision  filter 
bank  using  the  Lagrange- Newton  method  (Lagrange-22)  in 
[6],  and  the  well-known  32-tap  quadrature  mirror  filter  bank 
(QMF  32D),  designated  32D  in  [7].  A  two-level  hierarchical 
structure,  formed  by  the  basic  four-band  equal-split  structure 
as  shown  in  Fig.  6,  is  used,  yielding  a  total  of  16  subbands. 
The  rates,  in  bits  per  pixel  (bpp),  and  the  peak  signal-to-noise 
ratios  (PSNR’s)  are  summarized  in  Table  HI.  Here  the  PSNR 


Fig.  6.  The  basic  four-band  equal-split  analysis  structure  for  subbaad 
image  coding.  LPk:  lowpass  filtering  in  the  horizontal  direction.  HPt: 
highpass  filtering  in  the  horizontal  direction.  LP,:  lowpass  filtering  in  the 
vertical  direction.  HP,:  highpass  filtering  in  the  vertical  direction. 


Fig.  7.  Original  image.  (For  color  supplement  see  p.  392.) 


Fig.  8.  Reconstructed  image  using  the  CSD-22  filter  bank  and  FDSBC 
algorithm,  resulting  in  PSNR  =  34.36  dB  and  bit  rate  of  0.807  bpp.  (For 
color  supplement  see  p.  392.) 


TABLE  m 

PSNR  and  Rate  of  the  Tested  Filter  Banks 


PSNR  (dB) 

Rate  (bpp) 

Filter 

FDSBC 

FRSBC 

FDSBC 

FRSBC 

CSD-22 

34.36 

30.96 

0.807 

0.447 

Lagrange-22 

34.39 

30.98 

0.808 

0.448 

QMF  32D 

34.42 

30.77 

0.820 

0.447 

is  related  to  mean  square  error  d  by 


PSNR  =  10  log  |o 


255J 

~d~ 


dB. 


Fig.  9.  Reconstructed  image  using  the  Lagrange-22  fiber  bank  and  FDSBC 
algorithm,  resulting  in  PSNR  =  34.39  dB  and  bit  rale  of  0.808  bpp.  (For 
color  supplement  see  p.  392.) 


Fig.  10.  Reconstructed  image  using  QMF  32D  filter  bank  and  FDSBC 
algorithm,  resulting  in  PSNR  =  34.42  dB  and  bit  rate  of  0.820  bpp.  (For 
color  supplement  see  p.  392.) 

The  original  image  and  the  reconstructed  images  using 
FDSBC  for  CSD-22,  Lagrange-22,  and  QMF  32D  are  shown 
in  Figs.  7,  8,  9  and  10,  respectively.  We  notice  that  the 
bit  rates,  PSNR’s  and  the  subjective  reconstructed  image 
achieved  by  CSD-22  are  almost  the  same  as  those  of 
Lagrange-22.  This  confirms  that  the  extremely  small  signal- 
reconstruction  error  resulting  from  CSD  design  is  negligible 
in  practice.  Also,  it  is  evident  that  the  performance  of 
CSD-22  is  comparable  to  that  of  QMF  32D.  The  complexity 
of  CSD-22,  however,  is  much  lower  than  the  others.  The 
Lagrange-22  filter  bank  requires  22  multipliers  and  42  adders 
to  implement  the  analysis  filter  bank;  and  the  QMF  32D  filter 
bank  requires  16  multipliers  and  32  adders/subtractors.  Our 
CSD-22  filter  bank,  as  described  in  Example  1 ,  needs  a  total 
of  only  83  adders/subtractors.  The  results  here  show  that  the 
proposed  design  technique  provides  an  easy  way  to  design 
low-complexity  analysis/synthesis  filter  banks  for  high-per¬ 
formance  subband  image  codecs. 

V.  Conclusions 

A  search  technique  has  been  presented  for  the  design  of 
multiplierless  two-channel  linear-phase  FIR  filter  banks.  The 
Alter  coefficients,  are  represented  by  a  CSD  code,  which 
makes  it  feasible  to  build  the  entire  Alter  bank  on  a  single 
chip  using  modem  VLSI  technology.  Good  Altering  perfor¬ 
mance  and  nearly  perfect  signal  reconstruction  have  been 
demonstrated  through  design  examples.  When  applied  to 


subband  image  coding  the  proposed  design  technique  has 
yielded  comparable  coding  performance  and  much  less  de¬ 
sign  complexity,  compared  with  other  inAnite -precision 
design  techniques.  This  feature  of  high  performance  with  low 
design  complexity  should  help  make  the  recently  popular 
subband  image  coding  technique  even  more  attractive. 
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1  Abstract 

For  high-speed  communications  applications,  most  of  the  filtering  requires  narrow  pass- 
band  filters,  which  means  that  long  FIR  (Finite  Impulse  Response)  digital  filters  are  needed.  It  is 
well-known  that  one  of  the  disadvantages  of  FIR  filters  is  their  high  computational  complexity.  In 
order  to  reduce  the  number  of  adders  and  multipliers  required,  an  attractive  alternative  for  realizing 
the  narrow  band  filters  is  to  use  a  structure  composed  of  a  cascade  of  an  RRS  (Recursive  Running 
Sum)  prefilter  and  a  corresponding  magnitude  response  equalizer  [1,2].  This  report  presents  a  silicon 
compiler  for  digital  FIR  RRS  prefilter  integrated  circuits  designed  in  the  Mentor  Graphics  GDT 
CAD  environment.  The  design  goals,  in  decreasing  order  of  importance,  for  this  RRS  prefilter  are: 
high  speed,  small  area,  and  low  power  dissipation.  By  using  carry-save  arithmetic  in  the  hardware 
implementation,  the  critical  path  of  the  RRS  prefilter  is  made  indeperdent  of  the  data  word  length, 
which  in  turn  means  that  the  data  word  length  does  not  affect  the  prefilter’s  speed.  The  critical  path 
is  composed  of  only  two  adders  and  a  multiplexer.  One  noteworthy  point  is  that  the  total  number  of 
adders  required  is  independent  of  the  prefilter  order  as  a  result  of  rewriting  the  transfer  function.  The 
prefilter  is  capable  of  implementing  both  the  lowpass  and  highpass  functions.  Several  prefilters  can 
be  cascaded  in  series  to  enhance  the  performance.  A  prototype  chip  has  been  generated  from  the 
compiler.  It  has  been  tested  to  be  fully  functional  and  it  is  expected  to  achieve  a  throughput-rate  of 
about  175  MHz  in  a  1 ,2-pm  CMOS  process.  The  die  size  of  the  prototype  chip  is  4.0mm  x  3.1mm 
(with  pads). 
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2  Design  Methodology 

In  order  to  fully  realize  the  advantage  of  a  prefilter/equalizer  structure,  the  prefilter  must  be 
able  to  operate  at  the  same  high  speed  as  the  equalizer.  While  the  equalizer  FIR  filter  can  be 
implemented  simply  with  pipeline,  the  RRS  prefilter  has  a  recursive  loop  and  requires  a 
programmable  delay  line  which  makes  its  implementation  more  difficult  than  that  of  the  equalizer. 
Since  the  equalizer  is  able  to  operate  at  1 75  MHz,  our  target  speed  for  the  prefilter  should  be  just  as 
high,  i.e.  175  MHz.  Hence,  full  custom  design  datapath  has  to  be  used  to  meet  the  high  speed 
requirement.  In  addition,  power  and  area  can  also  be  optimized  simultaneously.  Therefore,  a  full 
custom  design  cell  library  is  created  in  the  Led  layout  editor  tool  for  the  leaf  cells. 

Top-down  approach  was  used  in  our  design.  First,  we  decided  the  function  and 
specifications  of  the  chip  and  the  best  architecture  to  meet  the  requirements.  We  then  investigated 
into  each  functional  block  and  determined  the  leaf  cells  required.  The  leaf  cells  were  manually  laid 
out  in  Led.  This  allows  better  control  of  the  critical  delay  path,  area  compaction  and  transistor  sizing. 
After  each  cell  was  checked  with  the  on-line  GDT  LRC  (Layout  Rule  Checker)  to  make  sure  that  it 
was  free  of  design  rule  errors,  its  nedist  was  extracted  and  Lsim,  a  functional  simulator,  was  used  to 
check  the  cell’s  functional  behavior.  The  cell  was  then  optimized  for  timing  with  Hspice,  a  circuit 
simulator.  With  all  the  leaf  cells  ready,  several  Lx  generators  were  written  to  produce  the  different 
functional  blocks  one  at  a  time,  with  input  parameters  such  as  the  input/cutput  data  word  length,  the 
width  of  the  power  supply/ground  bus,  and  the  maximum  programmable  delay  value.  These 
functional  blocks  were  checked  for  design  rule  errors  and  functional  behavior.  Also,  the  critical 
delay  path  for  the  various  blocks  was  simulated  extensively  to  obtain  a  more  accurate  estimation  of 
the  worst-case  propagation  delay.  Finally,  the  various  blocks  were  assembled  together  with  another 
Lx  generator  yielding  the  layout  shown  in  Fig.  1 .  The  resultant  layout  was  checked  for  design  rule 
errors  using  GDT  LRC.  Then,  it  was  checked  with  another  more  thorough  rule  checker.  Checkmate. 
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To  verify  that  the  connections  of  the  various  blocks  are  correct,  and  the  resultant  layout’s  functional 
performance  is  up  to  our  expectation,  Lsim  was  used  to  simulate  the  different  cases.  A  Genie 
program  was  written  to  compare  the  Lsim  simulation  results  with  the  expected  results  to  further 
ensure  the  chip’s  functionality.  In  order  to  speed  up  the  functional  simulation,  M-language  functional 
models  were  written  for  some  of  the  building  blocks  and  leaf  cells. 


3  Introduction 


This  report  presents  the  first  integrated  circuit  prototype  implementation  of  a  high-speed 

programmable  digital  FIR  prefilter.  It  is  well  known  that  one  of  the  disadvantages  of  FIR  digital 

filters  is  their  high  computational  complexity.  In  order  to  reduce  the  number  of  adders  and 

• 

multipliers  required,  a  structure  using  a  cascade  of  a  Recursive  Running  Sum  (RRS)  prefilter  and  a 
corresponding  magnitude  response  equalizer  has  been  proposed  [1,2].  Other  attractive  prefilter 
schemes,  such  as  prefilters  based  on  the  Dolph-Chebyshev  function  [3]  and  cyclotomic  polynomials 
[4]  have  subsequently  appeared.  We  have  set  high  speed,  small  area,  and  low  power  dissipation  as 
the  design  goals  of  our  programmable  prototype  chip.  The  RRS  structure  was  chosen  for 
implementation.  The  IC  was  fabricated  by  MOSIS  using  1.2-pm  N-well  technology.  A 
photomicrograph  of  the  prototype  chip  is  shown  in  Fig.  1 . 

The  basic  structure  of  a  lowpass  RRS  prefilter  [1]  with  impulse  response  of  length  L  is 
shown  in  Fig.  2.  Its  transfer  function  is: 


H(z) 


1 

L 


1  -z 


-1 


(1) 


Its  implementation  requires  only  two  adders,  (L+l)  delay  elements  and  a  scaling  multiplier.  The 
number  of  adders  used  is  independent  of  the  prefilter  order,  which  is  an  asset  in  creating  a  compact 
layout  for  a  programmable  structure.  The  frequency  response  of  an  RRS  prefilter  is  the  same  as  that 
of  a  length  L  rectangular  time-domain  window  function.  Therefore,  the  minimum  stopband 
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attenuation  that  any  RRS  prefilter  can  provide  is  approximately  13dB.  It  would  seem  desirable  to 
increase  this  rather  modest  level  of  stopband  attenuation.  In  addition,  the  RRS  prefilter’s  passband 
rolloff  needs  to  be  compensated.  Hence,  in  order  to  simultaneously  increase  both  the  passband  and 
stopband  performance,  the  modified  Simple  Symmetric  Sharpening  (SSS)  structure  [2],  as  shown  in 
Fig.  3,  is  of  particular  interest.  It,  however,  requires  one  additional  precise  multiplier,  as  will  be 
explained  in  Section  4. 


4  Architecture 

The  factor  limiting  the  speed  (i.e.,  maximum  data  rate)  of  an  RRS  implementation  is  the 
time  required  for  the  computations  performed  in  the  recursive  loop.  The  most  commonly  used 
methods  to  increase  the  maximum  data  rate  for  digital  signal  processing  applications  are  word-level 
pipelining,  retiming,  and  parallelism.  However,  none  of  these  techniques  can  be  carried  out  within 
a  recursive  loop  as  this  would  alter  the  filter’s  transfer  function.  Therefore,  carry-save  adders  (CS  As) 
were  used  in  our  prototype  chip  to  enhance  its  performance  by  pushing  the  carry  propagation  chain 
out  of  the  recursive  loop,  thereby  allowing  the  carry  propagation  to  be  performed  with  a  pipelined 
adder.  A  straightforward  implementation,  shown  in  Fig.  4,  gives  the  highest  operating  speed  because 
the  recursive  loop  is  composed  of  only  one  CSA.  However,  two  pipelined  adders  are  required  in  this 
implementation,  which  consumes  a  substantial  amount  of  area  and  imposes  a  considerable  loading 
on  the  high-speed  system  clock.  Hence,  we  decided  to  sacrifice  a  small  amount  of  speed,  and  we 
implemented  the  structure  shown  in  Fig.  5  which  uses  only  one  pipelined  adder.  The  recursive  loop 
is  now  composed  of  two  CSAs. 

To  meet  a  greater  variety  of  frequency  response  requirements,  the  prototype  chip  was 
designed  with  the  capability  of  implementing  both  lowpass  and  highpass  prefilters.  Multiplexers  are 
employed  to  specify  whether  or  not  to  take  the  complement  of  the  data  in  the  recursive  loop,  thereby 
performing  the  simple  lowpass-to-highpass  transformation:  z  -4  -Z .  As  a  result,  the  prefilter’s 
speed  is  limited  by  two  CSAs  and  a  multiplexer,  as  shown  in  Fig.  6. 
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Since  the  two  parallel  branches  within  the  shaded  block  of  Fig.  3  should  both  be  normalized 
by  an  identical  factor,  and  since  the  RRS  branch  H]  has  an  inherent  dc  gain  of  Lt,  we  must  either: 
( 1 )  use  a  precise  programmable  scaling  multiplier  of  1  ILt  for  ,  or  (2)  scale  up  the  data  in  the  lower 
delay  branch  by  a  factor  of  L\  and  then  perform  a  normalization  at  the  output,  after  the  addition  of 
the  two  branches.  Depending  on  the  dynamic  range  requirements,  the  normalization  in  the  second 
approach  can  be  either  a  precise  scaling  or  an  approximate  (power-of-two)  scaling.  The  latter 
scheme  is  used  in  our  design  since  a  precise  programmable  integer  multiplier  of  L\  is  easier  to 
implement  than  a  precise  multiplier  of  1/Lj .  Output  normalization  is  then  implemented  with  a  barrel 
shifter,  which  is  composed  entirely  of  n-type  pass-gates,  which  results  in  a  compact  layout. 


5  Programmable  Implementation 

To  allow  our  prototype  chip  to  be  programmable,  several  building  blocks  need  to  be 
programmable:  the  programmable  delay  line,  the  programmable  integer  multiplier  L\,  and  the 
programmable  barrel  shifter.  Additional  programmable  features  include  the  lowpass/highpass 
selection  and  a  user-specified  choice  of  implementing  a  stand-alone  RRS  prefilter  or  a  modified  SSS 
structure. 

A  DRAM  using  3-T  cells,  shown  in  Fig.  7,  is  used  to  implement  the  programmable  delay 
line  fi.e.,  zL  in  Fig.  2).  Since  the  DRAM  block  is  being  accessed  serially,  the  address  decoding 
scheme  can  be  simplified  by  taking  advantage  of  this  characteristic.  The  reading  of  the  first  DRAM 
column  is  being  done  exactly  L  clock  cycles  after  the  writing  of  the  same  DRAM  column.  Therefore, 
a  loadable  counter  is  an  ideal  element  for  keeping  track  of  the  number  of  clock  cycles  that  have 
evolved  and  initiating  the  read  signal.  After  the  read  signal  has  been  initiated,  it  can  be  propagated 
through  the  rest  of  the  DRAM  address  columns.  When  it  reaches  the  last  DRAM  column,  it  can  be 
fed  back  to  the  first  DRAM  column  and  the  whole  cycle  restarted  again.  In  other  words,  the  DRAM 
block  is  acting  like  a  circular  buffer.  Since  a  whole  DRAM  column  is  accessed  simultaneously 
whenever  the  column  is  being  read  or  written,  no  row  addressing  is  necessary.  The  column  address 
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decoding  circuits  are  simply  a  stage  of  C^MOS  shift  registers,  in  contrast  to  a  traditional  address 
decoding  implementation  which  would  require  a  stage  of  address  calculation  circuits  followed  by  a 
stage  of  address  decoding  circuits.  Through  the  use  of  simplified  address  generation  circuits,  not 
only  can  area  be  saved,  but  the  propagation  delay  time  is  also  shortened. 

There  are  three  main  operations  for  the  DRAM  block:  precharge,  read,  and  write.  With  high 
speed  as  a  crucial  design  goal,  separate  read  and  write  bit-lines  are  used-in  other  words,  dual  port 
DRAM  cells  are  used-so  that  write  can  operate  independently  of  read  or  precharge,  sacrificing  a 
rather  small  amount  of  area.  In  such  cases,  write  will  not  be  a  constraint  on  the  speed  of  the  DRAM. 
The  only  timing  constraint  is  that  precharge  and  read  should  be  non-overlapping  to  prevent  a  short- 
circuit  current  flow  from  power  to  ground,  which  would  consume  excessive  power.  Therefore,  the 
maximum  rate  of  operation  of  the  DRAM  is  determined  by  the  total  time  needed  to  precharge  the 
bit-line  and  then  to  perform  the  read  operation.  Due  to  our  high-speed  requirements,  a  decimated 
clock  with  half  the  speed  of  the  system  clock  is  used  with  the  DRAM.  This  effectively  doubles  the 
DRAM  duty-cycle.  The  only  drawback  in  using  such  a  scheme  is  the  need  to  use  extra  circuitry  for 
demultiplexing  the  input  bus  and  multiplexing  the  output  bus.  Since  the  clock  used  has  been 
decimated  by  a  factor  of  two,  this  automatically  imposes  a  constraint  that  the  programmable  delay 
has  to  be  even.  Hence,  a  stage  of  multiplexer  circuitry  is  needed  to  determine  whether  an  extra  latch 
stage  needs  to  be  bypassed,  depending  on  whether  L  is  odd  or  even.  Interleaving  had  also  been 
considered  as  an  alternative  to  the  decimated  clock  approach,  but  since  it  requires  a  complex 
clocking  scheme,  it  did  not  seem  to  be  the  best  approach  for  high-speed  operation.  The  architectural 
block  diagram  of  the  DRAM  is  shown  in  Fig.  8. 


6  Prototype  Chip 

A  prototype  chip  which  can  function  either  as  a  stand-alone  RRS  prefilter  or  as  the  shaded 
part  of  Fig.  3  was  fabricated  through  MOSIS  using  1 .2-pm  HPCMOS34  technology.  The  selection 
of  one  of  these  two  functions  is  achieved  through  the  multiplexers  shown  in  Fig.  9.  If  a  single  RRS 
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prefilter  is  needed,  then  the  select  signal  is  set  to  the  appropriate  value  such  that  the  multiplexers  pick 
the  branches  that  give  the  performance  of  a  stand-alone  prefilter.  Using  this  structure,  three  of  our 
prototype  chips  can  be  cascaded  to  implement  the  complete  modified  SSS  structure  of  Fig.  3. 

The  prototype  chip’s  datapath  is  shown  in  Fig.  10.  In  order  to  facilitate  the  chip’s  testing,  a 
pseudo  random  number  generator  (PRNG)  is  included  on-chip.  It  is  a  type  0  linear  feedback  shift 
register,  designed  using  the  algorithm  outlined  in  [6].  The  PRNG  also  serves  as  a  buffer  for  the  input 
data.  A  control  signal  is  present  to  select  whether  the  source  of  the  input  data  should  be  from  the 
input  data  bus  or  from  the  PRNG.  The  accumulator,  as  mentioned  in  Section  4,  is  composed  of  two 
CSAs  and  a  multiplexer.  The  multiplier  is  implemented  using  the  programmable  canonic-signed- 
digit  carry-save  scheme  described  in  [5].  Since  the  outputs  of  both  the  accumulator  and  the 
multiplier  are  in  carry-save  formats,  a  CSA  block  composed  of  two  CSAs  in  series  is  necessary.  This 
block  converts  the  4-bit  vector  to  a  2-bit  vector  so  that  the  resultant  2-bit  vector  can  be  fed  into  the 
vector-merge  adder.  The  vector-merge  adder  is  implemented  with  a  six  stage  pipelined  carry-ripple 
adder.  The  adders  used  in  the  implementation  are  transmission-gate  adders  [5].  A  summary  of  the 
prototype  chip  is  given  in  Table  1 . 


Table  1  Summary  of  the  prototype  chip 


Technology 

1.2-pm  HPCMOS34  single  poly  double  metal 

Die  size  (with  pads) 

4.0mm  x  3.1mm 

Input  word  length 

16  bits 

Output  word  length 

16  bits 

Internal  word  length 

23  bits 

Number  of  pins 

65 

Maximum  prefilter  length 

32 

Testing  results 

fully  functional 

Yield 

100%  (25  parts  fabricated,  25  parts  fully  functional) 

Maximum  data  rate 

175  MHz  (simulated) 
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7  Testability  and  Testing  Results 

With  over  35k  input  vectors  tested  on  the  LV500  tester,  the  chip  has  tested  to  be  fully 
functional.  The  yield  is  an  excellent  100%  for  the  25  parts  fabricated.  Since  the  target  speed  of  the 
chip  is  about  175  MHz,  in  order  to  assist  in  the  high-speed  testing  of  the  chip,  a  pseudo  random 
number  generator  (PRNG)  was  placed  on  chip  to  achieve  high  fault  coverage.  The  PRNG  generates 
a  white  noise  input.  By  connecting  the  prefilter  chip  with  a  D/A  converter  and  then  connect  the 
results  to  a  spectrum  analyzer,  the  frequency  response  can  be  observed. 

8  Conclusions 

A  silicon  compiler  for  RRS  and  SSS  digital  FIR  prefilter  integrated  circuits  has  been 
designed  in  the  Mentor  Graphics  GDT  CAD  environment.  The  design  goals  for  this  prefilter  are  high 
speed,  small  area,  and  low  power.  By  using  carry-save  arithmetic  in  the  hardware  implementation, 
the  critical  path  of  the  RRS  prefilter  is  made  independent  of  the  data  word  length,  which  in  turn 
means  that  the  data  word  length  does  not  affect  the  speed  of  the  prefilter.  To  be  precise,  the  critical 
path  is  composed  of  two  CSAs  and  one  multiplexer.  Three  of  our  prefilter  ICs  can  be  cascaded  to 
enhance  performance  and  implement  the  complete  SSS  prefilter  of  Fig.  3.  A  prototype  chip  has  been 
generated  from  the  Lx-language  compilers  and  it  is  tested  to  be  fully  functional  with  over  35k  input 
vectors.  The  yield  is  100%  for  the  25  parts  fabricated.  It  is  expected  to  achieve  a  throughput-rate  of 
175  MHz  (simulated)  in  a  1.2-pm  CMOS  process.  The  die  size  of  our  prototype  chip  is  4.0mm  x 
3.1mm  (with  pads). 
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Fig.  2.  RRS  prefilter  realization. 
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Fig.  4.  Implementation  with  two  pipelined  adders  and  a  carry-save  adder. 
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Fig.  7.  A  3-T  DRAM  cell. 
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Fig.  8.  Architectural  block  diagram  of  the  DRAM. 
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